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The Relationship between Mechanical Aptitude and Proficiency 
Tests for Air Force Mechanics 


Major Thomas L. Wood 
Standards Branch, Hq., USAF, Washington, D. C. 


In September, 1952, the United States Air 
Force installed a world wide proficiency test- 
ing program to assist classification officers in 
selecting airmen qualified for advancement to 
higher skill levels. The program included use 
of written tests, custom built by professional 
test technicians, to cover over 200 specific 
Air Force jobs. 

Development of the tests was carried out 
at Headquarters Air Materiel Command 
(Wright-Patterson AFB), Headquarters Air 
Training Command (Scott AFB), and Head- 
quarters Continental Air Command (Mitchel 
AFB).’ Special units, under the direction of 
officers with psychological training, were or- 
ganized to write the tests, score the answer 
sheets, and provide continuous statistical 
analysis to improve successive forms of each 
test. Subject matter specialists were selected 
by the major field commands of the Air Force 
from master sergeants with wide experience 
in or supervising the specific jobs for which 
the tests were built. On the average, five 
master sergeants worked with a professional 
test development technician in writing each 
test. Specialty descriptions of the Airman 
Career Program were used as guides in build- 
ing test outlines and in weighting the task 
elements of each job. 


Problem 


Several Air Force studies in the past have 
shown the relationship between aptitude 
scores and success in training to be signifi- 
cant (3, 4). Brown and Ghiselli in a sum- 

1 These three units were consolidated into the 


2200th Test Squadron, Mitchel AFB, Long Island, 
N. Y., in April 1953. 
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mary of the findings of research studies since 
1919 concerning the predictive power of tests 
of intelligence, speed of perception, and spa- 
tial and motor aptitudes found aptitude tests 
to be very useful in predicting training suc- 
cess (1). However, when the aptitude tests 
are related to job proficiency measures such 
as speed and amount of production, achieve- 
ment tests, and supervisor’s ratings the pre- 
dictive power drops considerably. 

Since the Air Force had never used job 
proficiency tests as qualification standards be- 
fore 1952, it was considered necessary to de- 
termine the relationship between the new 
proficiency tests developed by airman spe- 
cialists and the Airman Classification Bat- 
tery (ACB) (2). Aptitude scores from the 
ACB are used at Air Force Military Train- 
ing Wings to determine the initial classifica- 
tion and assignment of airmen to technical 
training courses. 


Rationale 


Since the proficiency tests were being used 
to measure mandatory job knowledge mini- 
mums requisite to award of higher skills, 
they were considered to be a practical cri- 
terion of knowledge needed to be successful 
on the job. An airman who fails to acquire 
the required minimum knowledge of his job 
is restricted in skill advancement and, conse- 
quently, in promotion to higher rank. Pass- 
ing the appropriate proficiency test then be- 
comes one objective index of success on the 
job. 

During September, 1952, 9,234 airmen were 
tested on the senior aircraft mechanic’s test. 
A random sample of 461 cases was selected 
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to study the relationship between aptitude 
scores and proficiency scores. The data were 
divided into four cells on the basis of pass- 
fail * on the proficiency test and qualified-not 
qualified on the mechanical aptitude test. 

It will be noted in Table 1 that 36 (8%) 
of the airmen failed to attain a qualifying 
score on the aptitude test, while 115 (25%) 
of the airmen tested failed the proficiency 
test. Of those failing the aptitude test, 33 
(92%) also failed the mechanic’s proficiency 
test. 


Table 1 


Relationship between Performance of Aircraft 
Mechanics on Mechanical Aptitude 
and Proficiency Tests 


Proficiency Test 


Failed 


Passed Total 


343 
(74%) 


Passed 82 
(18%) 


Failed 33 3 
(1%) 


(7%) 


Mechanical 
Aptitude Test 


346 
(75%) 


Total 115 
(25%) 


Source: Air Force sample, N = 461, tested Septem- 
ber 1952. 


* Significant at the 1% level of confidence. 


The Pearson r between the two tests was 
found to be .61, significant at the 1% level 
of confidence. 

From an Air Force population of 2,426 air- 
men tested in November, 1952 on the senior 
vehicle mechanics’ proficiency test, a random 
sample of 303 airmen was selected. 

As indicated in Table 2, 23 of 39 (59%) 
airmen- who were below standards in me- 
chanical aptitude passed the proficiency test 
while only 16 (41%) failed the proficiency 
exam. In this case it is obvious that me- 


2 Passing on the Proficiency Tests was established 
at a Standard Score of 80, based on the total Air 
Force population tested (Standard Score distrib- 
tion mean 100, std. dev. 20). Aptitude minimum 
scores are established for each Career Field as a re- 
sult of research conducted by the Human Resources 
Research Center to determine aptitude scores predic- 
tive of success in technical training courses (5). 
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Table 2 


Relationship between Performance of Vehicle 
Mechanics on Mechanical Aptitude 
and Proficiency Tests 








Proficiency Test 
Failed Passed Total 





264 
(88%) 


Passed 26 
(9%) 


238 
(79%) 


Failed 16 23 
(S%) (7%) 


Mechanical 
Aptitude Test 


Total 42 261 
(14%) (86%) 





Source: Air Force sample, N = 303, tested Novem- 
ber 1952. 


* Significant at the 1% level of confidence. 


chanical aptitude is not so highly related to 
the ability to pass a proficiency test custom- 
built to fit the job. The Pearson r in this 
case was .35, significant at the 1% level of 
confidence. 

A further random sample of 189 airmen 
was drawn from 1,079 senior weapons me- 
chanics tested in November 1952. Table 3 
shows that 18 men (10%) were below the 
minimum in mechanical aptitude. Of these 
only seven (39%) passed the proficiency test, 


Table 3 


Relationship between Performance of Weapons 
| Mechanics on Mechanical Aptitude 
} and Proficiency Tests 








Proficiency Test 


Failed Passed Total 





135 171 
(71%) (90%) 


Passed 36 
(19%) 


Failed 11 7 18 
(6%) (4%) (10%) 


Mechanical 
Aptitude Test 


Total 47 142 
(25%) (75%) 


fy = 35° 





Source: Air Force sample, N = 189, tested Novem- 
ber 1952. 
* Significant at the 1% level of confidence. 





Relationship between Mechanical Aptitude and Proficiency Tests 


while 11 (61%) failed on the proficiency 
score. The Pearson r was .35, significant at 
the 1% level of confidence. 


Summary and Conclusions 


Table 4 summarizes the data concerning 
aircraft mechanics, vehicle mechanics, and 
weapons mechanics. When each group is di- 
vided into a dichotomy of high aptitude and 
low aptitude, it can be readily seen that the 
failure rates for the low aptitude men are 
much higher on the appropriate proficiency 
test. 


Table 4 


Failure Rates on Proficiency Tests for High 
and Low Aptitude Mechanics 








Total N 
Aptitude N Failed 


Fail 


Group Rate 





19% 
92% 


Aircraft | 


High 425 82 
Mechanics 


Low 36 33 


Vehicle \ 
Mechanics 


High 264 26 
Low 39 16 


10% 


41% 


High 36 21% 
Mechanics | Low 18 11 61% 


Weapons } 


17% 


z High 860 144 
Total | 65% 


; Low 93 60 





Source: Air Force sample, N = 953, tested Septem- 
ber and November 1952. 
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Since not all airmen attend technical schools 
where they can be evaluated shortly after 
they receive aptitude tests, it is important for 
the Air Force to have aptitude scores which 
have value in predicting not only success in 
training but also relative performance after 
experience on the job. 

From the data presented it would seem 
that present mechanical aptitude tests pre- 
dict future assimilation of job knowledge to 
a usable degree. 


Received November 9, 1953. 
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A Comparative Evaluation of Two Approaches to Job-Knowl- 
edge Test Construction * 


Harry M. Mason? 


University of Illinois 


Writers concerned with construction of per- 
sonnel tests (1, 2, 3, 4, 7) advise somewhat 
different procedures for selecting and editing 
job-knowledge or trade test contents prior to 
tryout, but evidence that alternative ap- 
proaches result in tests having different rela- 
tionships to criteria is lacking. Experience 
in editing test content assembled by teams of 
expert workers under the guidance of test 
technicians suggested to the writer that two 
general approaches to the item writing task 
are used, one of which may be called the job- 
requirements approach and the other the job- 
experience approach. It seemed likely that 
the test items produced through these two 
approaches would exhibit corresponding dif- 
ferences in their relationships to criteria of 
job success. 

The present study reports an empirical try- 
out of three newly constructed tests of job 
knowledge applicable to airplane and engine 
mechanics maintaining piston-engined air- 
craft, and three existing tests from the Air- 
man Classification Battery. Two of the new 
tests were constructed in accordance with the 
job-experience approach to test construction, 
and one in accordance with the job-require- 
ments approach. The existing tests were as- 
signed to the two approaches after examina- 
tion of their contents. After preliminary 
studies of the new tests with Air Force in- 
ductees and airplane and engine mechanic 
trainees, all six tests were administered to a 
group of working airman airplane and engine 
mechanics at an Air Force Base. Results 
have been evaluated to show the degree to 
which tests assigned to each approach are 


1This research was supported in part by the 
United States Air Force under Contract No. AF 
33(038)-25726, monitored by the Human Resources 
Research Center. Permission is granted for repro- 
duction, translation, publication, use and disposal in 
whole and in part by or for the United States Gov- 
ernment. 

2 The writer wishes to acknowledge help and criti- 
cism given by Prof. L. H. Lanier. 
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consistent in their relationship to criteria, and 
differences between the patterns of relation- 
ships to criteria characteristic of tests as- 
signed to the two approaches. 


The Two Approaches 


The job-requirements approach strives to meas- 
ure mastery of formally stated job requirements. 
Tests resulting from this approach are essentially 
examinations over training courses, job handbooks 
and similar materials. The job-requirements ap- 
proach seems likely to dominate production of 
test items whenever rapid production or revision 
of tests is required, since it allows the item writer 
to capitalize upon the organization already exist- 
ing in published materials. 

The job-experience approach to test construc- 
tion attempts to measure mastery of one or more 
topics representing distinctive learning opportuni- 
ties afforded by a job. Test items may relate to 
what the worker does, to what he may expect to 
happen on the job, or to knowledge resulting 
from job aspects having no suspected import- 
ance. Since this approach requires the test con- 
structor to select and organize learning oppor- 
tunities offered by the job, it is difficult to use 
when tests must be produced quickly. 

The two approaches differ in underlying as- 
sumptions. The job-requirements approach as- 
sumes that workers differ in the degree to which 
they meet “minimum” job requirements, and 
that any excess over the minimum level of this 
type of knowledge results in enhanced job per- 
formance. The job-experience approach assumes 
that all workers retained on the job exhibit 
above-minimum job knowledge, but that the 
quality of the worker’s adjustment to the job is 
best reflected in the use he makes of learning 
opportunities the job affords. 

If job requirements were completely realistic. 
artificial restrictions upon entry to the job and 
temptations to leave it negligible, and training 
for job advancement completely relevant, tests 
resulting from the two approaches might be 
highly similar. In anything less than the ideal 
condition, however, differences in tests produced 
through the two approaches would be expected. 
In the following paragraphs, tests assigned to the 
job-requirements approach are called require- 
ments centered, and tests assigned to the job- 
experience approach are called experience cen- 
tered. 
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Approaches to Job-Knowledge Test Construction 


Requirements Centered Tests. Alternate forms 
of the Electrical Information and Mechanical 
Principles Tests of the Airman Classification 
Battery were used.* They were assigned to the 
requirements-centered category because their con- 
tent and the descriptions given by Gragg and 
Gordon (5) indicate that they are concerned 
with principles taught in schools which prepare 
men to meet reauirements for entry into me- 
chanical jobs. The Electrical Information Test 
contains 30 four-choice items concerned with 
circuit diagrams and electrical principles. The 
Mechanical Principles Test presents 15 picture- 
type items relating to machines encountered in 
everyday life. 

The Training Research Laboratory (TRL) 
Aviation Mechanics Technical Knowledge Test 
was constructed for the present study. It con- 
sists of 60 four-choice items chosen from data 
of a long-range study employing 300 job-knowl- 
edge test items obtained from Air Force and 
Navy aviation mechanics schools. Items se- 
lected were the best discriminators between air- 
man trainees in early and late phases of tech- 
nical schools which are prerequisites for entry 
into the apprentice level of the Air Force job of 
Airplane and Engine Mechanic. Most of the 
items relate to the operation or malfunctioning 
of aircraft components. 

Experience Centered Tests. An alternate form 
of the Aviation Information Test of the Airman 
Classification Battery was assigned to the experi- 
ence-centered category. It is intended to meas- 
ure inductees’ attempts to gain contact with 
aviation work. The test contains 30 four-choice 
items. 

The TRL Aviation Information Test was con- 
structed from information contained in Jane (6). 
It has 30 five-choice items relating to relative 
cruising speeds and other performance charac- 
teristics of well known civilian and military 
airplanes, names of manufacturers of airplane 
equipment, and the equipment and components 
employed in different airpianes. Its content is in- 
tended to be continuous in type with that in the 
Airman Classification Battery Aviation Informa- 
tion Test;:it refers to information more readily 
available to aviation workers than to men out- 
side aviation jobs. 

The two aviation information tests were as- 
signed to the experience-centered category be- 
cause their content may be learned through ex- 
perience on the job, rather than through attempts 
to meet formal job requirements through schools. 


3 Air Force tests were made available by the Per- 
sonnel Research Laboratory, Human Resources Re- 
search Center, Lackland Air Force Base. Permission 
to employ these tests is gratefully acknowledged. 

4 The study is being conducted by Dr. E. L. Gaier 
under the present Air Force contract. Grateful ac- 
knowledgment is made. for permission to use these 
items. 
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The TRL Maintenance Techniques Test was 
made up as a result of interviews with 50 airmen 
recommended as expert airplane and engine me- 
chanics. Each interview was guided to cover all 
major airplane systems on the aircraft with which 
the interviewee was most familiar. Statements 
of interviewees led directly to test item content, 
or to intuitive guesses concerning the verbal self- 
guidance employed by good mechanics in the 
tasks mentioned. The test contains 76 four- 
choice items. It emphasizes airframe rather than 
powerplant systems. Most of the items relate to 
operations performed by the mechanic, rather 
than to the mechanical operation of aircraft 
components. 

Administration of Tests. Slightly less than 
four hours were required for subjects to answer 
the entire battery of six tests and to give per- 
sonal information and peer ratings. Air Force 
tests were finished within time limits and were 
scored according to prescribed formulae; new 
tests were given without time limits and were 
scored for number of correct responses. 

Subjects. Subjects were 204 airplane and en- 
gine mechanics chosen as every n-th name from 
alphabetical rosters of all airplane and engine 
mechanics at apprentice to supervisor or tech- 
nician levels in three airplane maintenance squad- 
rons at Lowry Air Force Base. In sampling, 
was chosen as every third or every fourth man, 
to give as nearly 70 men per squadron as possible. 

Criteria. Personal data statements concerning 
the length of aircraft maintenance experience 
were the principal criteria employed. In treat- 
ment of results, men claiming six years or more 
of experience are called high-experience men, 
those with less than six years of experience are 
called low-experience men. The distribution of 
experience, with a major mode at less than two 
years and a minor mode at more than eight years 
made a breakdown at this point convenient. Di- 
vision at this point is also in accord with the 
presumpéion that the low-experience group is 
composed primarily of men who have not yet had 
time to become fully competent, and that the 
high-experience men have, in general, met what- 
ever effective minimum job requirements exist. 
High-experience men averaged 10.4 years, low- 
experience men 1.7 years of experience. 

Peer ratings were also employed. Each me- 
chanic ranked for competence the six men whose 
work habits he knew most thoroughly. Men 
ranked 3.0 or less, on the average, by four or 
more peers are called “Good”; those ranked 3.1 
or more on the average are called “Poor”; men 
not ranked by as many as four peers are re- 
garded as “Not Rated.” After inspection of test 
results it was seen that at each experience level, 
the subgroup rated Poor was different from 
others, but that subgroups rated Good or Not 
Rated had essentially the same mean scores on 
all tests, there being no significant mean differ- 
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Table 1 


Mean Scores of Airplane and Engine Mechanic Criterion Subgroups 


High-Exper. 


Test 


Experience Centered: 


TRL Aviation Information** 14.1 
Air Force Aviation Information** 19.3 
TRL Maintenance Techniques** 38.9 


Requirements Centered: 
TRL Technical Knowledge*** 26.1 
Air Force Electrical Information 
Air Force Mechanical Principles 








Criterion Subgroup* 


High-Exper. 


Low- Exper. Low-Exper. 





“Other” “Poor” “Other” 
(N = 50) (N = 51) (N = 98) 
20.7 16.2 15.3 
24.2 21.6 21.1 
46.2 41.1 39.4 
34.5 30.2 34.6 
21.0 19.9 20.7 
11.0 11.4 11.4 





* Criteria employed were amount of aircraft maintenance experience and peer ratings. See text for method 


used to establish subgroups. 
** Test separates High-experience 
difference (1 per cent level). 


a 


other” subgroup from all three remaining groups by a highly significant 


*** Test separates either “Poor” group from all three remaining subgroups by highly significant differences 


(1 per cent level). 


ences. Consequently, in treatment of results, a 
subgroup rated Poor is differentiated from one 
designated as Others at each experience level. 

The presumption that the high-experience men 
are the more competent rests on the assumption 
that they were at least as able to benefit from 
experience at the outsets of their careers as were 
the low-experience men. High-experience men 
are, as expected, older, 80 per cent being over 
30, while less than 7 per cent of the low-experi- 
ence men are more than 30 years of age. High- 
and low-experience subgroups alike have 76 per 
cent who completed high school. Though the 
older men went to school when educational re- 
quirements were lower and thus might have been 
a more select group, they are also more likely 
to have achieved their high school graduation 
through GED examinations. Thus there is no 
strong evidence of any basic intellectual differ- 
ence between the two experience levels. Nearly 
half of the low-experience men had had all their 
maintenance experience on one type of airplane. 
None of the high-experience men reported fa- 
miliarity with less than two types of airplanes. 
One eleventh of the high-experience men, and 
one-third of the low-experience men were rated 
Poor in the peer ratings. 

Aptitude Indexes were available for only 117 
men, all in the low-experience category. The 
mean Mechanical Aptitude Index was 6.8, SD 
1.33; the mean Technician Specialty Index, com- 
parable to a group intelligence test score, was 
6.4, SD 1.88. Neither of these shows a high de- 
gree of selection, since Gragg and Gordon (5) 
found means for both Indexes to be 6.2 in 1,000 
presumably unselected inductees in 1949. Air 
Force Specialty Codes assigned the men ranged 


from the “3” to the “7” level, but these were not 
employed as criteria, since they had not all been 
assigned through the present uniform procedure. 

Among the low-experience men, the peer rating 
of Poor does not appear to be associated with 
low Mechanical Aptitude, the proportions of 36 
mechanics rated Poor and 81 rated Other having 
Mechanical Aptitude Indexes of 7 or higher be- 
ing 53 and 55 per cent, respectively. The pro- 
portion of Poor having Technician Specialty In- 
dexes of 7 or higher was 33 per cent, while that 
for Others was 48 per cent. This difference is 
nearly significant. 


Results 


Preliminary studies had shown that all tests 
distinguish significantly between groups of in- 
ductees and the working mechanic group.® 
In the present study, means of criterion 
subgroups are compared to determine (a) 
whether or not a common pattern of sub- 
group mean differences characterizes the tests 
assigned to each major approach; and (b) 


5 In preliminary studies, 108 airman inductees and 
206 airplane and engine mechanic trainees at the 
end of training preparatory to the apprentice level 
of the job were tested. Scores made by these groups 
indicated that all newly constructed tests differenti- 
ate significantly between inductees and trainees, and 
between trainees and the mechanics tested in the 
present study. Findings of Gragg and Gordon (5) 
compared with the mean scores on Air Force tests 
used in the present study indicate that the Air Force 
tests likewise distinguish working mechanics from 
inductees. 
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whether or not the pattern of criterion sub- 
group means produced by tests assigned to 
one approach differs meaningfully from the 
pattern produced by the tests assigned to the 
other. Brief consideration is given to cor- 
relations between tests, and between tests and 
Aptitude Indexes. 

As shown in Table 1, all of the experience- 
centered tests show the same rank order of 
subgroup means. High-experience mechan- 
ics rated “Other” have highest scores, fol- 
lowed by low-experience mechanics rated 
Poor, low-experience Other and high-experi- 
ence Poor. For each test the difference be- 
tween the high-experience Other and remain- 
ing subgroups is significant beyond the one 
per cent level, while differences separating the 
remaining subgroups one from the other are 
not significant. 

Requirements-centered tests do not present 
a completely consistent pattern of subgroup 
means, but if the Mechanical Principles Test, 
which is relatively short and unreliable, is ex- 
cluded, and fractional mean differences are 
ignored, a consistent pattern is evident. Both 
the Technical Knowledge and the Electrical 
Information tests place the low-experience 
Other subgroup and the high-experience Other 
subgroups first. low-experience Poor next, and 
high-experience Poor last. Both of the sub- 
groups rated Poor are reliably lower in mean 
score than either of the Other subgroups on 
the Technical Knowledge Test. No signifi- 
cant differences were produced by the other 
two requirements-centered tests. 

The high-experience Poor subgroup (N = 5) 
occupies bottom position in all tests; it there- 
fore does not assist in differentiating between 
the two approaches. The most probable ex- 
planation for the low test scores of this sub- 
group is that it represents a recognized mi- 
nority of poor mechanics who manage to keep 
their association with the job and the Air 
Force in spite of limited ability or poor mo- 
tivation. Three of the five men were ranked 
last unanimously by all mechanics who listed 
them among their closest working acquaint- 
ances. The characteristics of tests adequate 
to isolate members of this subgroup do not 
appear to be critical. 

Experience-centered tests discriminate un- 
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ambiguously in favor of high-experience me- 
chanics generally approved by their peers. 
Requirements-centered tests, to the extent 
that they discriminate among the subgroups, 
tend to isolate the subgroups rated Poor, but 
do not distinguish knowledge presumably 
learned primarily through experience. If the 
experience-centered tests measured nothing 
except length of tenure on the job, they 
would have little practical utility. To de- 
termine whether or not some mechanics gain 
this knowledge quickly, a check was made to 
see how many of the low-experience mechan- 
ics scored above the mean of the high-experi- 
ence Other subgroup, and to examine their 
personal data and rating characteristics. A 
total of 28 low-experience mechanics for 
whom aptitude indexes were available scored 
above the mean of the high-experience Other 
subgroup on either the TRL Aviation Infor- 
mation Test or the TRL Maintenance Tech- 
niques Test, or both. The breadth of experi- 
ence represented was substantfally the same 
as that of other low-experience men. Slightly, 
but not significantly more of the 28 men 
were rated Poor than among low-experience 
men generally. The group was outstandingly 
high on the Technical Knowledge Test, 21 
of the 28 having scores of 35 or higher. All 
but two of these men had both Mechanical 
and Technician Specialty Aptitude Indexes of 
6 or higher, their mean Mechanical Index be- 
ing 8.0 and their mean Technician Specialty 
Index being 7.6. Both means are signifi- 
cantly higher than the means of the remain- 
der. For 11 of the 28 men, Mechanical Index 
was higher, for 15, both were the same, and 
for two, the Technician Specialty Index was 
the higher. Thus it appears that experience- 
centered knowledge may be mastered early in 
an airman’s career, and that high aptitudes 
are indicative of ability to master it. On the 
other hand, these men’s peers are not inclined 
to regard them as outstanding mechanics. 
Low-experience men rated Poor have a 
slight tendency to score above other low-ex- 
perience mechanics on experience-centered 
tests. As indicated earlier, the Mechanical 
Aptitudes of the low-experience mechanics 
rated Poor are substantially equal to those of 
the Others, but their Technician Specialty 
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Indexes are nearly significantly lower. The 
low status of Poor subgroups on the Tech- 
nical Knowledge Test may be due to this 
test’s higher relationship to general intelli- 
gence, indicated by its correlation with apti- 
tudes shown in Table 2. The low-experience 
«Poor subgroup’s better status on the experi- 
ence-centered tests could not be due to better 
measured mechanical aptitude. It might pos- 
sibly be due to more realistic attitudes toward 
the job, resulting in effective but not spec- 
tacular job adjustment. Since all three dif- 
ferences are small and non-significant, they 
could be due to chance. 

Table 2 gives means and SDs of tests, split- 
half reliabilities, average correlation of each 
test with other tests, and correlation of each 
test with Mechanical Aptitude Index and 
Technician Specialty Index, for the 117 low- 
experience mechanics for whom aptitude in- 
dexes were available. Correlations presented 
indicate the TRL Aviation Information Test 
to be more independent of other tests and 
aptitude indexes than any other test used. 
The Air Force Aviation Information Test has 
correlations more nearly like those of the re- 
quirements-centered tests. The Air Force 
test is somewhat too easy for working me- 
chanics, having been designed for inductees; 
this test’s ability to distinguish between work- 
ing mechanics may be based to a considerable 
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extent upon differences in reading ability, 
rather than aviation information. On the 
whole, the experience-centered tests are less 
highly correlated with aptitude indexes than 
are the requirements-centered tests. Since 
these tests are related to criteria among both 
low-experience and high-experience mechan- 
ics, it would appear to be worthwhile to ex- 
periment with a mechanical aptitude index 
depending somewhat more upon tests of 
knowledge spontaneously acquired before en- 
tering the Air Force. It should be borne in 
mind, however, that the discussion relating 
to correlations is based only upon differences 
in their size, without regard for significance 
of these differences. More studies are needed 
before a firm interpretation of these relation- 
ships is attempted. 

A limitation applying to all results of the 
present study is that the study is cross sec- 
tional; differences between mature and less- 
experienced mechanics could possibly be due 
to selective attrition. This is not likely, con- 
sidering the personal data differences between 
the two experience levels, but only longi- 
tudinal studies, employing broad batteries of 
tests, can establish surely the changes in job- 
knowledge which differentiate between ma- 
ture workers who have demonstrated satis- 
factory adjustment to the job and beginners 


Table 2 
Test Score Distribution Characteristics (N = 204) 








Correlationsf 








Ave., 
with 
Other Mech. Tech. Sp. 
Test Mean SD Rel.*t Tests Apt.** Aptitude** 
Requirements Centered: 
TRL Technical Knowledge 33.2 7.2 81 48 50 42 
AF Electrical Information 20.5 5.0 75 43 58 42 
AF Mechanical Principles 11.3 25 50 29 56 61 
Experience Centered : 
AF Aviation Information 22.0 4.5 71 38 51 48 
TRL Aviation Information 16.7 5.1 74 30 23 34 
TRL Maintenance Techniques 41.3 7.4 72 40 58 37 





*Odd-even reliability, corrected by Spearman Brown Formula. 
Jt Decimal points are omitted from correlation coefficients. 


me — with Aptitude Indexes are for 117 low-experience mechanics for whom Aptitude Indexes were 
available. 
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whose aptitudes are known only in terms of 
test scores. 

The most probable explanation for the dif- 
ferent relationship to criteria shown by ex- 
perience-centered and requirements-centered 
job-knowledge tests is that requirements-cen- 
tered tests emphasize formal requirements 
some of which are not functionally effective, 
and that for the proven formal requirements 
measurable with tests, critical levels of mas- 
tery have not been incorporated into test ma- 
terials. Since some discrepancies between 
stated requirements and learning opportuni- 
ties accepted by good workers on the job may 
be expected always to exist, there appears to 
be a continuing need for empirical studies 
aimed at improving the coverage of job 
knowledge tests. 


Received December 15, 1953. 
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Judgments of Performance in Process ' 
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In work sample performance testing, judg- 
ments are often made by the test adminis- 
trator regarding the manner in which an ex- 
aminee performs the components of a specific 
task. For instance, in a Drill Point Grinding 
Performance Test, the test administrator is 
asked to make judgments on the manner in 
which the examinee holds the drill while 
grinding, on whether the examinee wears 
loose clothing that could snag in the grinding 
wheel, on whether the examinee inspects the 
grinding wheel for cracks prior to grinding, 
on whether the examinee oscillates the drill 
while grinding, ‘etc. One of the problems 
connected with this type of judgment is that 
the perceptions of the test administrators 
themselves may vary from time to time and 
thus may represent an uncontrolled variable 
in work sample performance testing. Cloth- 
ing may be perceived on one day as being 
loose enough to snag in a grinding wheel 
while two weeks later the same clothing, 
worn in the same manner, may be perceived 
as being perfectly safe. Of course, one way 
in which this type of variation may be par- 
tially controlled is to keep the items to be 
judged gross enough and objective enough so 
that misperception is minimized. If a defi- 
nite frame of reference is written into each 
element scored, and if the observations re- 
quired are kept gross, then the danger of 
perceptual variability in examiners may be 
minimized. 

It is still incumbent on the test user to 
ascertain the reliability of the observations of 
the people who act as test administrators. 
The ideal method for determining the con- 
sistency of an individual examiner is the 
situation in which the examinee’s performance 

! The data herein reported are a small portion of 
the data gathered under Contract Nonr-872(00) be- 
tween the Institute for Research in Human Relations 
and the Office of Naval Research. The opinions ex- 
pressed are those of the author and do not neces- 


sarily represent the opinions of the Office of Naval 
Research or of the naval service. 


is held constant over two separate occasions 
and the examiners’ perceptions allowed to 
vary. Since the stimulus configuration re- 
mains constant, any unreliability shown can 
then be attributed to variation within the ex- 
aminer. However, unfortunately, no one can 
possibly perform the same job in exactly the 
same manner on two separate occasions. One 
method by which performance may be held 
constant is to take a motion picture of the 
examinee performing the job. The motion 
picture may then be shown on two separate 
occasions and the examiner asked to score the 
motion picture rather than to score the actual 
live performance. Thus the stimulus situa- 
tion is held constant over the two time in- 
tervals and any variation shown may be at- 
tributed to variation within the examiner. 
Two assumptions of this method are that the 
movie situation presents the same stimulus 
configuration to the examiner as does the 
actual work sample performance test situa- 
tion and that the examiner scores the movie 
in the same manner as he would score an 
actual work sample performance test. Fur- 
ther assumptions are the usual assumptions 
made in any test-retest reliability check. The 
principal disadvantage of the motion picture 
technique is that movies are difficult and ex- 
pensive to produce. This is especially true 
for long, involved jobs. 

The purpose of this paper is to present our 
method of using the movie technique for de- 
termining intra-examiner reliability and to 
present the intra-examiner reliabilities ob- 
tained from a small group of test adminis- 
trators. 


Method 


A 16 mm. black-white movie was made of a 
Naval Aviation Structural Mechanic taking a 
Drill Point Grinding Work Sample Performance 
Test. The film was unrehearsed and the only in- 
structions given the subject, a randomly selected 
Aviation Structural Mechanic, were “to grind the 
drill as he would ordinarily do it.” S was told 





Retest-Reliability by a Movie Technique 


Care and Use of Tools 


1. Did the examinee check the tool rest for proper distance from periphery of the grinding 


wheel? 


2. Did the examinee ever adjust the tool rest while the wheel was in motion? 


3. Did the examinee use a coolant while grinding the drill? 


Procedure 


4. Did the examinee read the “Examinee Instructions?” 


a ee 
|. SE 


To....Ne. 


Yes 


Yes No... 


5. While grinding, did the examinee oscillate the drill so that heel was moved along the 


surface of the grinding wheel? 


. Did the examinee hold the shank slightly lower than the point while grinding? 
. Did the examinee alternate from flute to flute while grinding? 
. Did the examinee grind one flute and then the other? 


9. Did the examinee check the shank of the drill for bends and burns? 


10. Did the examinee secure the grinding wheel? 


11. Did the examinee “police up” the work area when securing? 


Safety Precautions 


12. Did the examinee wear eyeshields or goggles while grinding? 
13. Did the examinee tap the grinding wheel or check it for cracks prior to its use? Yes 


14. Did the examinee wear loose clothing or clothing that could snag in the grinding wheel? Yes 


Fic. 1. 


that we were going to take movies of him while 
he was working. The motion picture cameras 
and lights were not hidden, but their presence 
and the knowledge that his behavior was being 
photographed did not seem to affect the S's be- 
havior. 

The motion picture was then first shown in the 
training room of VC-4, Naval Air Station, At- 
lantic City, to five Chief Aviation Structural 
Mechanics. These chiefs had previous experi- 
ence in work sample performance test adminis- 
tration and were moderately well informed in the 
general principles of work sample performance 
test administration. The movie was reshown to 
the same chiefs one month after its first ad- 
ministration. One month is usually accepted as 
a sufficient time interval for forgetting of origi- 
nal responses. Moreover, the chiefs did not 
know that they would be asked to make exactly 
the same observations on two separate occasions. 
Therefore, there was little reason for them to 
try to remember their original responses. 

The chiefs were asked to fill in the Movie 
Evaluation Form (see Figure 1) during each 
showing of the motion picture. The Movie 
Evaluation Form contained fourteen items such 
as—‘Did the examinee check the tool rest for 
proper distance from the periphery of the grind- 
ing wheel?”; “Did the examinee ever adjust the 
tool rest while the grinding wheel was in mo- 
tion?”; “Did the examinee wear loose clothing or 
clothing that could snag in the grinding wheel?”’; 
“Did the examinee check the shank of the drill 
for bends and burns?”; etc. Sufficient light was 
allowed in the “theater” so that the chiefs could 
fill in the forms as the appropriate action was 


Yes. No 

Yes aoe 
, eee aes 
Yow Nea. 
fe ee 
4. we 


ee, “Se 


fe 
No 
ae 


Movie evaluation form. 


performed. Thus the motion picture situation 
was as close as possible to actually scoring a 
work sample performance test. 


Results 


The results in terms of the consistency of 
the observations of the Chief Structural Me- 
chanics who viewed on two separate occasions 
the drill point grinding motion picture are 
presented in Table 1. 


Table 1 


Intra-Examiner Consistency for Measurements of 
Performance in Process 








Per Cent 
Consistency 


Observer 





85.6 
71.4 
100.0 
64.3 
92.8 


Mean 82.8 





In: preparing Table 1, we called S consist- 
ent on an item if he answered the item on the 
second showing of the movie in exactly the 
same manner that he did on the first show- 
ing. Thus: 
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Number of items answered in exactly the same 


Intra-examiner consistency = 


manner on each showing of motion picture 





The grand mean for intra-examiner agree- 
ment was 82.8% with a range from 64.3% 
to 100%. This mean of 82.8% agreement 
would usually be considered adequate if con- 
verted into a correlation coefficient and inter- 
preted as correlation coefficients are usually 
interpreted. Of course, these intra-examiner 
reliability estimates are based on only one 
motion picture. The danger of generaliza- 
tion from one measure of the reliability of 
observations of performance in process to all 
observations of performance in process is self 
evident. 

In view of the range shown, the desirability 
of determining the reliability of the observa- 
tions of examiners prior to assigning them to 
test administrative duties is also indicated. 
If all examiners show low consistencies, then 
either the examiner training has been poor or 
the test itself is inadequate. Naturally, only 
those examiners with high consistencies are 
worthy of consideration as test administrators. 

The problem of how high examiner con- 
sistency must be before it is high enough re- 


Total number of items on questionnaire 


x 100 





mains open. A second problem remaining 
open is that of the effect of increasing the 
number of judgments to be made on intra- 
examiner reliability. That is, will the Spear- 
mafi-Brown prophecy formula hold in this 
situation? 


Summary 


A motion picture technique was described 
and the results of its use in determining the 
intra-examiner reliability for performance test 
administrators’ observations of performance 
in process were indicated. The mean intra- 
observer consistency for observations of ele- 
ments of a drill point grinding task on two 
separate showings of the movie was ade- 
quate. However, the range of consistency 
was great enough to warrant a recommenda- 
tion in support of careful investigation of the 
intra-examiner reliability prior to the ad- 
ministration of work sample performance 
tests. 


Received November 25, 1953. 





THe Journat oF APPLIED PsyCHOLOGY 
Vol. 38, No. 6, 1954 


Influences on Merit Ratings 


Aaron J. Spector 


Officer Education Research Laboratory, Air Force Personnel and Training 
Research Center, Maxwell Air Force Base * 


Many sources of errors in merit ratings are 
well known to users of these devices. Labora- 
tory and field investigations have identified 
errors which may be classified as: (a) char- 
acteristic biases of classes of raters, e.g., 
men, women, peers, etc.; and (b) universal 
errors, e.g., halo effect, error of central tend- 
ency, etc. Somewhat neglected is the fact 
that the stimulus, the ratee’s behavior, may 
contribute errors which are not ordinarily 
considered. His total behavior is complex 
and includes some behaviors which are perti- 
nent and some which aren’t pertinent to the 
tactors on which he is being rated. Evalua- 
tion of the pertinent behaviors independently 
of all others may require special training of 
the raters. This may be especially true when 
the factors being evaluated are in themselves 
complex and subjectively loaded, e.g., po- 
tentialities of the ratee, cooperativeness, 
quality of work, etc. Irrelevant character- 
istics may be so influential as to seriously 
bias the evaluations on the desired charac- 
teristics. The research presented here has 
been designed to investigate the effects of 
irrelevant ratee behaviors on ratings assigned 
to him. 

A ratee characteristic, which is irrelevant 
to the others being evaluated, has been ex- 
perimentally varied in order to measure its 
effects on the pertinent characteristics. The 
variable being manipulated is that of amen- 
ability to suggestions. This variable was se- 
lected because of the prevalence in industry 
of situations where suggestions may be ac- 
cepted or rejected by the ratee and may, 
therefore, influence the rater’s evaluation of 


* The author was a member of the faculty at the 
University of Massachusetts when this study was 
conducted. He wishes to express his gratitude to 
his colleagues who contributed their class time to 
this research, and to Mr. Churchill Morgan for the 
preliminary analyses of the data. 

1For a summary of the major studies, see (1). 
Mahler’s (2) review is more comprehensive and 
recent. 


other characteristics. In order to complete 
the experimental design a second variable, 
the rater’s opportunity to make suggestions 
to the ratee, was also manipulated. 


Procedures 


In five sections of a General Psychology 
course > a guest lecturer was introduced to 
the class as a student who was interested in 
becoming a college teacher. The classes, 
ranging in size from 19 to 30 students, were 
advised that they would be asked to evaluate 
his teaching ability after he had lectured. In 
all classes he delivered the day’s lecture in 
exceedingly poor fashion, making several glar- 
ing pedagogical errors, although the material 
itself was adapted from a well known text- 
book. After the first 15 minute period, the 
experimental variable was introduced accord- 
ing to the plan shown in Table 1. Three of 
the groups (A, B, and C) wrote notes to the 
lecturer after the first 15 minutes, suggesting 
improvements to be made in his techniques. 
A second 15 minute lecture followed, which 
was as poor as the first. At the conclusion 
of this lecture the students evaluated the lec- 
turer using a rating scale described below. 

After looking over the notes in Group A 
the lecturer accepted the suggestions by 
thanking the students for them and express- 
ing his intention of modifying his techniques, 
as per their suggestions. In Group B he 
rejected their suggestions by telling them he 
had his own ideas on improvement. Although 
the students in Group C also wrote notes 
they were not submitted until the conclu- 
sion of the second 15 minute period of lec- 
ture. At this time they made their evalua- 
tions and then submitted their suggestions. 

2The subjects were sophomore students at the 


University of Massachusetts. Sections of this course 
were randomly assigned to the experimental treat- 
ments. 

3 The guest lecturer was trained for approximately 
seven hours in order to insure that his delivery 
would be comparable in all classes. 
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Table 1 
Experimental Design 
Treat- 15 10 AS 10 
ment minutes minutes minutes minutes 
A lecture suggestions lecture rating 
written and 
acce pled 
B lecture suggestions lecture rating 
written and 
not accepted 
Cc lecture suggestions lecture rating 
written but 
not submitted 
D lecture no suggestions; lecture rating 
announcement 
read instead 
E lecture no suggestions; 


ratings made 





Groups D and E were not given the oppor- 
tunity to suggest any changes to the ratee. 
Instead of writing suggestions Group D lis- 
tened to an announcement read by the offi- 
cially assigned instructor; the amount of time 
required fur the announcement was roughly 
equivalent to the time other groups used in 
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Group E made no suggestions and evalu- 
ated the lecturer after the first 15 minute 
period. 

The lecturer was evaluated on a rating 
scale containing five questions measuring: 
(1) manner; (2) ability; (3) knowledge; 
(4) potential; and (5) poise. For each ques- 
tion the individual subjects checked one of 
seven boxes which were ordered on a con- 
tinuum, as illustrated by question 1, which 
read, “Compared to others, this lecturer’s 
manner while lectyring was: As poor as any 
I’ve seen; Const worse than most; 
Not quite as good a§most; As good as most; 
Somewhat better than most; Considerably 
better than most; As good as any I’ve seen.” 
The responses on each factor were weighted 
0-6, higher scores being assigned to the more 
favorable responses. 


Results 


The most favorable ratings on all five fac- 
tors were recorded by the acceptance group 
(A), as shown.in Table 2. The poorest rat- 
ings were given by Group E, which made no 
suggestions and had only 15 minutes of lec- 
ture. The other no-suggestion group (D) 
also rated the lecturer relatively unfavorably. 
The Mean ratings of B and C groups were 














writing suggestions. equal, but higher than either D or E. It ap- 
Table 2 
Means and Standard Deviations of Ratings on Each Characteristic for Each Treatment 
Questions 
Treat- 1 2 3 4 5 
ment manner ability knowledge potential poise N Meany SDrow 
A M 2.11 2.16 2.95 3.47 2.37 19 2.61 .86 
SD 45 .59 51 .88 87 
B M 1.28 1.52 25 2.60 1.36 25 1.88 1.19 
SD 17 .94 .93 1.10 1.30 
c M 1.30 1.64 225 2.57 2.04 28 1.88 .99 
SD 70 .98 95 1.01 64 
D M 1.53 1.69 2.29 2.06 1.43 35 1.76 .92 
SD .89 .69 81 .89 .87 
E M 1.50 1.13 2.07 1.93 1.50 30 1.55 1.34 
SD 1.51 85 1.18 1.26 1.54 
SDeor 1.02 91 .98 1.12 1.15 
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pears that expression of criticism of the lec- 
turer, via written suggestions, resulted in 
raters giving higher evaluations than when 
the raters had no opportunity for this expres- 
sion. These results obtained when the rater’s 
suggestions were not submitted to the ratee, 
as well as when they were submitted and ac- 
cepted or rejected. 

The most favorable ratings, however, were 
consistently made by the group whose sug- 
gestions were accepted by the ratee. Appar- 
ently, amenability to suggestions or expressed 
intention of compliance with the suggestions, 
operated to bias the raters’ evaluations of the 
lecturer. 

The data were analyzed further by analysis 
of variance.* An F ratio, obtained with total 
scores® of all subjects in each treatment, 
indicated that the mean total scores were 
significantly influenced by the treatments ac- 
corded the groups (Table 3). 


Table 3 


Analysis of Variance of Total Merit Rating Scores of all 
Subjects in Five Experimental Treatments 





MS. ..S 


Source of Variation df 





Between treatments 4 2123.66 549  .01 


Within treatments 132 386.51 





The ratings on each characteristic were 
then examined. F ratios indicated that the 
responses on four of the questions varied sig- 
nificantly between groups (Table 4). That 
is, the experimental treatments accorded to 
the groups differentially affected their ratings 
on four out of the five characteristics. 

The only Between Groups variance which 
was not significantly different from chance 
was on evaluation of the lecturer’s ability. If 
the students measured the lecturer’s ability 
by the amount they had learned or by the 
quantity of notes they could take, it is un- 
derstandable that their evaluations would 
agree since neither learning nor note taking 
came easily from his lecture. 


4 The variances were found to be homogeneous by 
Bartlett’s test. 

5 An average intercorrelation of .18 was obtained 
between items on the rating sheet, = Peters and 
Van Voorhis’ formula (4, pp. 196-200). 


Table 4 


Analysis of Variance of Ratings on Each Question 
for All Treatments Simultaneously 








Between Treatments 





Ques- 
tion Within Treatments 





2.902 
1.052 
025 
801 132 
2.882 4 
1.109 132 
8.516 4 
132 
4 
132 





However, no such simple criteria existed for 
rating his manner, knowledge, poise, or par- 
ticularly his potential. These ratings may 
reflect personal frames of reference and hence 
are more readily influenced by extraneous 
factors such as acceptance or rejection of 
suggestions. Similarly, the factors of pro- 
motability and quality of work, which are 
frequently found on industrial merit rating 
scales, may be especially prone to the influ- 
ence of irrelevant behaviors of the ratee. 


Discussion 


The cathartic effects of expression of criti- 
cism via written messages, noted above, are 
consistent with the findings of Thibaut and 
Coules (4). Their data indicated that per- 
sons who were insulted, and then allowed to 
express their hostility toward the instigator, 
via written notes, later made a greater num- 
ber of friendly remarks about the instigator 
than did other insulted persons who had no 
opportunity to express their hostility. The 
present data suggest that poor impressions, 
like ill feelings, may be altered or reduced, 
by their expression. Low ratings may re- 

6 The dynamics of this phenomenon are described 


by Newcomb (3) in his discussion of “autistic hos- 
tility.” 
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flect a barrier in communications between the 
supervisor and his subordinates, rather than 
true deficiencies of the ratees. Therefore, a 
likely hypothesis is that merit ratings in in- 
dustry may be influenced by the degree to 
which the rater feels free to criticize or make 
suggestions to ratees. 

The practical importance of the finding 
that irrelevant characteristics of ratees may 
bias raters’ judgments is difficult to evaluate 
without more knowledge of: (a) the kinds of 
ratee behaviors which act in this way; and 
(b) the amount of bias these behaviors in- 
duce. At any rate, it is clear that amen- 
ability to suggestions induces sufficient bias to 
significantly affect ratings on several factors. 


Summary 


Students in five sections of a general psy- 
chology course listened to a lecture which was 
intentionally delivered in poor fashion. They 
were then asked to rate the lecturer on five 
characteristics, using a seven point scale. Be- 
fore they rated him three of the groups sug- 
gested methods by which the lecturer might 
improve his techniques. One of these groups 
did not submit their suggestions to the lec- 
turer; in another group the lecturer rejected 
the suggestions, while in the third he ac- 


Spector 


cepted them. In two other groups the sub- 
jects did not write suggestions. In no case 
did the lecturer actually implement the sug- 
gestions, or improve his delivery. 

The ratings were: (a) consistently most 
favorable in the acceptance group; (b) more 
favorable in the suggestion than the no-sug- 
gestion group; (c) significantly different on 
the characteristics of manner, poise, potential 
and knowledge. 

It has been suggested that poor ratings 
may reflect barriers in communications be- 
tween the rater and the ratee, rather than 
true deficiencies in the ratees. 


Received November 20, 1953. 
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Recently controversy has arisen over the 
methods that can appropriately and meaning- 
fully be used in psychological research on ac- 
cidents. An article by Mintz and Blum (8) 
advocating use of the Poisson distribution and 
analysis of variance for estimating the ex- 
tent of personnel-centered accident liability 
precipitated the discussion. In opposition, 
Maritz (7) argued that “. . . the direct tech- 
nique of ‘correlating consecutive periods’ is in- 
dispensable.” More recently, however, Blum 


and Mintz (1) and particularly Webb and 
Jones (13) have pointed out that the differ- 
ent techniques are basically the same and 
should for all practical purposes yield similar 


results. 

This paper aims to point out certain short- 
comings of these methods and to propose 
more refined solutions. 


The Poisson Distribution 


A frequency distribution of numbers of ac- 
cidents by individuals can be symmetrical 
only when the mean number of accidents is 
appreciably greater than unity. When there 
are fewer accidents than individuals, the zero 
category must have the greatest frequency, 
and superficially at least the obtained dis- 
tribution must resemble the Poisson distribu- 
tion—often used to estimate the extent to 
which variations in accident histories may be 
attributable to chance factors. Methods of 
computing the theoretical distribution and of 
testing the difference between it and the one 


1This paper was prepared at MacDill Air Force 
Base while PHDB was acting as consultant to the 
Human Factors Operations Research Laboratories. 
The opinions expressed in this paper are those of 
the authors and do not necessarily reflect the views 
of the Air Force. 


actually obtained are explained elsewhere (2, 
8, 9). 

Interpretation of the obtained results is, 
however, fairly difficult. First, let us con- 
sider the simpler of the two possibilities— 
that in which the obtained distribution devi- 
ates significantly from a Poisson. Ordinarily 
this result is interpreted as indicating the 
presence in the population of varying degrees 
of accident proneness. Such an interpreta- 
tion is justified only if all persons were ex- 
posed to the same hazards; if this were not 
the case, the significant deviation from the 
Poisson might be reflecting little more than 
differences in exposure. 

Second, let us consider the opposite result 
—that in which the obtained distribution does 
not deviate significantly from a Poisson. 
This result is usually interpreted as indicat- 
ing that chance factors may account for the 
obtained variations in accident records and 
that the null hypothesis cannot therefore be 
rejected. Here again, however, the inter- 
pretation is open to question, for representa- 
tion of the data by a Poisson does not elimi- 
nate the possibility of significant correlation, 
either between accident records in successive 
periods or between accident records and logi- 
cally related predictors. Maritz (7) has al- 
ready demonstrated the possibility of ob- 
taining a correlation of .80 between two 
Poisson distributions of accidents in separate 
time periods. 

The coarseness of the measuring unit—the 
fact of an accident—presents further logical 
problems in interpreting the results of a 
Poisson fit. For administrative purposes, 
some arbitrary definition of an accident is 
necessary: however, rigid adherence to the 
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definition forces into a discrete series behav- 
iors which really are on a continuum. It 
seems perfectly reasonable to assume that 
the same behaviors which might in one in- 
stance result in extensive materiel damage or 
personal injury might in another result in 
merely a “close call.” It therefore seems 
logical to assume also that the persons com- 
prising each of the accident frequency groups 
(zero, single, or multiple) exhibit these be- 
haviors in varying degrees and could, if more 
complete information were available, be fur- 
ther differentiated. 

In short, use of the Poisson distribution is 
at best a preliminary step in the study of 
accident proneness. It serves only to provide 
a quantitative estimate of the stability of 
accident rates. It furnishes no information 
whatsoever on the relationships between spe- 
cific personnel and situational factors on the 
one hand and accidents on the other. 


Correlational Techniques 


Correlational techniques have most fre- 
quently been used in accident research to 
provide a quantitative estimate of the con- 
sistency of the accident behavior of individu- 
als from one period of time to another. In 
attempts to equate exposure in the two pe- 
riods, most investigators have chosen periods 
of odd and even months (4) or odd and even 
days (12). Obtained correlations are typi- 
cally low, indicating little reliability of the 
accident criterion. 

The magnitude of the correlation is a func- 
tion of both the accident rate and the length 
of the exposure period. Because of the 
coarseness of the accident criterion used in 
most studies, accident rates are low and, 
hence, the exposure periods required for ob- 
taining reliable measures are almost prohibi- 
tive in length for most practical purposes. 

Interpretation of the results of studies of 
this sort is difficult because of the lack of 
adequate information on exposure. Even if 
the assumption of equal exposure among in- 
dividuals be granted, the correlation of acci- 
dents in separate time periods provides little 
more than a preliminary estimate of the sta- 
bility of the accident criterion. Such an ap- 
proach affords no means of identifying per- 
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sons most liable to accidents except from 
their accident histories. It does not, there- 
fore, provide any basis for predicting which 
members of an independent sample will have 
accidents until after some have experienced 
them. Since our goal is not only to reduce 
but also to prevent accidents, this approach 
obviously is inadequate. 

A more adequate but less widely-used ap- 
plication of correlational techniques in acci- 
dent research is that of correlating individu- 
als’ scores on theoretically-related predictor 
variables (such as intelligence, psycho-motor 
ability, physical condition, etc.) with their 
accident records. In actual practice, however, 
this approach (3) too has typically yielded 
low correlations—again probably because of 
the grossness of the accident criterion. 


Suggested Refinements 


The design of research on personnel factors 
in accidents is admittedly difficult. However, 
we would like to suggest several refinements 
which would make such research more useful. 

The first and most needed refinement is a 
more sensitive criterion measure. The sys- 
tematic collection of infgrmation on “near- 
accidents” and the critical behaviors involved 
therein would provide such a criterion (10). 
The Strategic Air Command, USAF, has al- 
ready adopted a policy of collecting these 
data and using them as a basis for both 
remedial and preventive training in how to 
react in emergency situations. Thus, “near- 
accident” data may have practical value, even 
before they accumulate in sufficient quantity 
to provide a reliable criterion against which 
personnel factor variables can be validated. 
The high ratio of near-accidents to accidents 
indicates that the length of the exposure pe- 
riod required for reliable measurement would 
be considerably shorter. 

A second much-needed refinement is better 
differentiation between “personal” and “situa- 
tional” accidents.2, We must discriminate be- 
tween those accidents caused largely by situa- 


2 The comments contained in this and the follow- 
ing paragraph apply to whatever criterion is used— 
be it accidents, near-accidents, or a combination of 
both. The term “accidents” alone is used in the 
subsequent discussion solely for simplicity of pres- 
entation. 
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tional factors and those caused primarily by 
personnel factors. Theoretically, situational 
accidents, such as those caused solely by ma- 
teriel failure, occur completely independently 
of the personnel involved. If this be the case, 
we should not expect to be able to predict 
them from a knowledge of personnel factor 
variables only. The inclusion of situational 
accidents in a study of personnel factors will 
lower the obtained correlations and make 
them extremely difficult to interpret. 

One might legitimately ask if the proposed 
differentiation can successfully be made. The 
recent study of Kubis, Buckley, and Sack- 
man (5) indicates that it can. 

A third requisite for more meaningful 
studies of personnel factors in accidents is the 
systematic collection of more complete infor- 
mation on exposure to hazard. Whenever 
possible, records of the time spent in per- 
formance of the various aspects of a job 
should be maintained, for they provide the 
basis for determining relative hazards. Once 
we have adequate data on the time spent and 
accidents incurred on the several parts of a 
job, we can determine the risk per unit of 
time for each. We can then compute an 
index of exposure for each person which 
weights his experience in the various phases 
of the job by the risk associated with each. 
An index of this sort has recently been de- 
veloped and successfully applied to an Air 
Force population by Warren et al. (11). 

If such an index be available for all mem- 
bers of our sample, we need not make the 
questionable assumption of equality of ex- 
posure. Instead, we can, to some degree at 
least, remove the effect of differential ex- 
posure from our computations by partial cor- 
relation or other appropriate technique. 

A fourth proposed refinement is the com- 
putation of correlations between individuals’ 
scores on certain theoretically-related pre- 
dictor variables and their records of accidents 
in which the measured traits are thought to 
be important. For example, we would hy- 
pothesize that psycho-motor test scores should 
predict only those accidents in which psycho- 
motor deficiencies are a primary contributing 
factor. Testing the significance of the ob- 
tained correlation coefficient would then en- 
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able us to confirm or refute our hypothesis. 
On the other hand, correlation of scores on 
such tests with over-all accident records would 
be likely to obscure any relationships which 
might exist. 

In groups where the number of accidents is 
small and the members are engaged in di- 
verse tasks for which the degree of hazard is 
difficult to estimate. correlational studies of 
the type just outlined are not likely to prove 
worthwhile. Consequently, a different at- 
tack must be made on the accident problem. 
The one recommended is detailed situational 
analysis, accomplished by experienced job 
analysts or safety engineers or both and fol- 
lowed by administrative actions designed to 
overcome identified hazards. As a matter of 
fact, a thorough situational analysis (6) is 
indispensable even in those instances (such 
as in the military and in large industrial or- 
ganizations) in which the correlational ap- 
proach is applicabie, for the contribution of 
personnel factors to existing accident rates 
is usually considerably less than that of situa- 
tional variables. 

Until those factors, either within individu- 
als or within situations, related to accidents 
are clearly identified and remedial actions in- 
stituted, much research remains to be done. 
As yet, our research efforts have not ap- 
proached this point. 


Received December 21, 1953. 
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This study was undertaken in the hope of 
contributing to a clarification of the evidence 
on which the widely accepted theory of in- 
dividual differences in accident liability is 
based. This evidence is incomplete. One of 
the principal facts included in it is the fre- 
quently-good fit of obtained accident dis- 
tributions to the so-called unequal liabilities 
distribution ' derived by Greenwood and Yule 
(4). This derivation includes the assumption 
of constancy of individual liability to acci- 
dents, i.e., the notion that accident liability 
of particular persons does not change with 
the passage of time or with the occurrence of 
accidents. 

However, there are theoretical considera- 
tions which suggest that such an assumption 
is not likely to be exact. An accident can be 
expected to function at times as a traumatic 
experience and to disrupt subsequent behav- 
ior. It can also be expected to function as a 
punishment and, as such, to have one or an- 
other effect on the learning of the individual. 
There are some accident distributions which 
do not fit theoretical distributions based on 
assumptions of the constancy of accident lia- 
bility (3). Very likely, in such cases acci- 
dent proneness is affected by the accidents. 

Even in the cases in which theoretical dis- 
tributions based on the assumption of con- 
stant accident proneness do fit the data, the 
possibility of inconstant accident proneness 
cannot be excluded. It has been shown, e.g., 
by Irwin (6), that the same distribution 
which Greenwood and Yule derived in part 
from the assumption that accident liability 
varies from one individual to another but is 
constant for each individual, can also be de- 
rived from the assumption that there are no 
initial variations in accident liability, but that 
instead each accident increases the proneness 
of the individual by a constant amount. It 
can undoubtedly also be derived from an as- 
sumption of large initial differences in acci- 
dent liability and decrease of accident lia- 
bility with the occurrence of accidents. 


1 Or negative binomial distribution. 


Thus only tentative inferences may be drawn 
about the probable underlying distribution of 
accident liability from an obtained set of ac- 
cident records, unless something is known 
about whether and how accidents occurring 
to people affect their accident liability. Not 
even the existence of initial differences in ac- 
cident liability in the group may be inferred 
with certainty without such knowledge. In 
the absence of such knowledge, the assump- 
tion of unchanged liability after accidents 
was generally either implied (1) or explicitly 
made (10, 11) by workers in this field. 

The factual evidence on the validity of the 
assumption of accident liability as unchanged 
by accidents is very scanty. Irwin has com- 
mented upon a few results on accident rates 
of groups of people in consecutive periods 
which were opposed to his hypothesis of acci- 
dent proneness increasing with accidents. The 
accident rates did not tend to increase; the 
changes were slight, and, if anything, the 
rates tended to drop. A rather similar find- 
ing has been discussed by Kerrich (8). On 
the other hand, Horn has presented material 
on time-intervals between airplane accidents 
which suggested to him that accident suscepti- 
bility is temporarily increased by accidents. 
He recommended adjustment techniques for 
pilots following accidents. Thus the question 
may be of great practical importance to those 
interested in preventing accidents. 


The Problem 


It was thought that further research on 
time-intervals between accidents was desir- 
able. Increasing accident proneness should 
show itself as a trend toward decreased time- 
intervals between accidents, while decreasing 
accident proneness should show the opposite 
trend. The problem of this paper was to dis- 
cover whether trends towards such changes 
of time intervals do or do not occur. 

However, there are methodological difficul- 
ties in such research. Thus it is not im- 
mediately obvious, how the interval before 
the first accident and the interval after the 
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last accident during the arbitrary observation 
period should be treated, compared to the 
time intervals between accidents. This paper 
reports an attempt to deal with one set of 
data on time-intervals between accidents. It 
is hoped to provide examples of the type of 
information which can be obtained from the 
study of time-intervals, and to present some 
material relevant to the methodology in this 
field. The material is examined chiefly in 
relation to two possible theories: first, that 
accident proneness is constant for each indi- 
vidual; and, conversely, that proneness is in- 
creased with accidents. 


The Data 


The data examined were accident records 
of 178 taxi drivers, made available by Dr. 
E. Ghiselli, whose cooperation is appreciated. 
The period covered was one year. For each 
driver, the weeks in which accidents occurred 
were indicated. All drivers had worked for 
the company at least a year prior to the be- 
ginning of the observation period. Six drivers 
who resigned from their jobs during the ob- 
servation period, or who were absent from 
work for eight or more weeks, were elimi- 
nated. Thus records of 172 drivers were in- 
cluded in this study. 


The Mathematical Background 


In order to discover what the time intervals 
before, between, and after the accidents indi- 
cate about the possible efiects of accidents on 
accident liability, it is essential to compare 
them to the statistical expectancies based on 
the assumption that accidents are distributed 
over a time period completely at random. 
The hypothesis of random distribution of 
points within an interval has been previously 
studied by Whitworth (13), Greenwood (2), 
Moran (12), and Maguire, Pearson and 
Wynn (9). It assumes that each accident is 
independent of all other accidents and that 
its occurrence is equally probable at all times 
during the period. However, this is assumed 
only if each accident is viewed as a separate 
entity, defined in terms of what happens 
(e.g., sideswiping a particular telephone pole) 
rather @han in terms of the position of the 
accidents relative to each other. Sideswiping 
the telephone pole is more likely to be the 
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first accident if it happens early in the ob- 
servation period than if it happens late; and 
so with other types of accidents. The prob- 
ability that a particular accident is the first 
one during the observation period is propor- 
tionate to the probability that no other acci- 
dent has yet taken place. This probability 
decreases with the passage of time. 

If m accidents have happened to an indi- 
vidual during a time interval of unit duration, 
the probability of the first accident happen- 
ing at time x decreases in proportion with 
(1 —x)""' as x increases. The probability 
function of the first accident within a total 
time interval of unit length is given by the 
expression (1 —x)"-". The probability of 
the second accident at time x involves first, 
that one accident must have taken place al- 
ready, and second, that no other accident 
shall have happened yet. Its formula is 
n(n —1)x(1—~-x)"*, and similar expres- 
sions may be derived for the probabilities of 
the times of the other accidents. 

Probability functions can generally be used 
for the computation of theoretical means and 
standard deviations. Such computations indi- 
cate that, in terms of the null hypothesis, sta- 
tistical expectancies for the mean time-inter- 
vals from the beginning of the observation pe- 
riod to the first accident; from the first to the 
second; and so on, including the time-inter- 
val between the last accident and the end of 
the observation period are the same. Simi- 
larly, the expectancies of the variances of the 
time-intervals are also identical. In studying 
the possible effects of accidents on accident 
liability, the periods before the first accident 
and after the last accident may be treated in 
the same manner as the intervals between 
accidents. 


Results 


Of the 172 drivers included in the com- 
putation, 60 had no recorded accidents and 
112 drivers had one or more accidents, rang- 
ing up to 25. The accident distribution was 
very different from the theoretical distribu- 
tion which results from the assumptions of 
equal and constant accident liability. In an 
equal liability or so-called Poissonian dis- 
tribution, the variance of accidents is equal 
to the mean. In the Ghiselli data, it is about 





Time Intervals between Accidents 


six times as large as the mean. The distribu- 
tion seems to be capable of being explained 
in terms of the hypothesis of large stable dif- 
ferences in accident liability. On the other 
hand, in accordance with the considerations 
mentioned earlier, it also can be explained in 
terms of other assumptions, e.g., that of linear 
increase of accident liability with accidents. 

What do the time-intervals indicate? They 
were first examined separately for groups of 
drivers with different accident records. 

A total of 45 drivers had one accident each. 
The theoretical expectancy for the position of 
the mean time of a single accident is 26 
weeks. The obtained mean was 21.9 weeks. 
The critical ratio was 1.99,° which is signifi- 
cant at the .05 level. It should be noted 
that this suggestive difference is in the op- 
posite direction from the one which would be 
expected if accident liability increased with 
accidents. The question was not investigated 
whether this result was due to the fact that 
accident liability decreased with the first ac- 
cident in this group, or whether it was pro- 
duced by seasonal fluctuations. 

In the two-accident case, the situation was 
somewhat similar. The mean durations of 


the three time-intervals (up to the first acci- 
dent, between the first and second accidents, 
and from the second accident until the end of 
the observation period) were 13.1; 15.6; and 


23.2, respectively. The theoretical expect- 
ancy is 17.33, with a standard error of 3.06 
weeks. The differences between the time-in- 
tervals are again suggestive of a decrease in 
accident liability after accidents, but the re- 
sult does not seem to be statistically signifi- 
cant. 

The situation was somewhat different in 
the cases which had three, four, and five acci- 
dents. Here the first and last time-intervals 
were longer than the time-intervals between 
accidents, but this finding was again of doubt- 
ful statistical significance. Groups with more 
than five accidents were too small for de- 
tailed presentation. 

The significant fact which emerges from 
the examination of the mean time-intervals 
of groups of drivers, classified on the basis of 

*The critical ratio rather than the t-ratio was 


used because a theoretical standard deviation could 
be and was utilized. 
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number of accidents, is that there was no con- 
sistent trend toward a decrease of time-inter- 
vals with repeated accidents. Table 1 pre- 
sents the data. 

In the preceding discussion, separate com- 
parisons of time intervals were made within 
each group of drivers with a particular num- 
ber of accidents. These groups were for the 
most part very small. Therefore the data 
were also treated in another way, in terms of 
cumulative groups. For all drivers who had 
one or more accidents, the mean time inter- 
val before their first (or only) accident was 
ascertained. The mean time interval before 
the second accident was ascertained for all 
drivers with two or more accidents, and for 
the same group the mean time interval be- 
fore the first time accident was also com- 
puted. Similarly, the mean time intervals 
before the third accident and before the first 
accident were determined for the group with 
three or more accidents, and so on. 

The results of these computations are pre- 
sented in Table 2, the first column of figures 
giving the mean times between the consecu- 
tive accidents and the second column giving 
the mean times before the first accidents of 
the same people, and the third column pre- 
senting the differences. It should be noted 
that according to both hypotheses considered 
in this paper, the figures in the first column 
should tend to decrease as one proceeds down 
the table. 

According to the theory of individual dif- 
ferences in accident proneness, the same de- 
crease is expected in the second column; this 
is to be expected because the bottom of the 
table deals with drivers who had repeated 
accidents, because repeated accidents are apt 
to be indicative of high accident proneness, 
and because high accident proneness is apt to 
result in short time intervals both before the 
first accident and between the later acci- 
dents. According to the theory of increased 
proneness following accidents the decrease 
should be much more pronounced in the first 
column than in the second one,* and the dif- 
ferences in the third column should tend to be 
negative and to increase in absolute amount. 

% There are two reasons for expecting some down- 
ward trend in these figures according to the behavior 


disruption theory. First, there are selective factors: 
people who had the first accident early “by chance” 
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Table 1 
Mean Times * Before the First Accident, Between Accidents, and After the Last Accident 

One-accident group (n = 45): 21.9; 30.1 
Two-accident group “(nm = 16): 13.1; 15.7; 23.2 
Three-accident group (n = 13): 16.1; 9.6; 12.9; 13.4 
Four-accident group (n = 13): 14.0; 8.6; 5.5; 5.6; 18.3 
Five-accident group (n = 4): 9.7; 6.0; 5.5; 7.8; 8.2; 14.8 
Six-accident group (n= 5): 5.9; 4.8; 6.8; 9.2; 10.6; 10.0; 4.7 
Seven-accident group (n = 3): 2.5; 10.0; 5.3; 5.7; 9.0; 7.0; 6.0; 6.5 
Eight-accident group (n = 2): 1.0; 2.0; 6.5; 7.5; 5.5; 2.0; 2.5; 11.5; 13.5 
Nine-accident group (n = 3): 2.2; 2.7; 9.7; 6.7; 3.0; 3.3; 1.3; 8.0; 5.0; 10.2 
Eleven-accident group (n = 2): 8.5; 2.5; 1.5; 2.5; 1.5; 3.5; 6.0; 4.5; 8.5; 2.5; 6.0; 5.5 
Twelve-accident group (n = 1): 2.5; 1; 6; 4; 6; 2; 1; 2; 6; t; 2; 4; 14. 5 
Thirteen-accident group (n= 1): 1.531; 5:1; 2:1; 1; 8; 2: 13:1; 11; 4; 0.5 
Fifteen-accident group (n = 1): 1.5:1:1:3:3;2 3; 6; 5; 1; 4; 5; 3; 10; 2; 1.5 
Sixteen-accident group (nae €)t 2:55 7s 2s 2s 05 1; S¢:6:'33.13:5; 25.6;:2; 2-33 35 
Eighteen-accident group (n = 1): 0.5; 2; 1; 12; 1; 1; 3; 1; 1; 1; 1; 1; 5; 1; 2; 1; 3; 10; 4.5 
Twenty-five-accident group (nm = 1): 3.53131; 1; 2; 1; 135; 2; 1; 1; 1; 6; 1; 6; 1; 1; 1; 7; 1; 1; 3; 1;1;1;0.5 





* In computing the mean times it was assumed that the accidents occurred in the middle of the week. 


The increase of the magnitude of the differ- 
ences should occur because the accidents in- 
tervening between the first accident and the 
later ones are assumed to increase their acci- 
dent proneness. This factor would be as- 
sumed to produce a marked decrease in the 
figures of the first column; it would be as- 
sumed to be lacking in the case of the figures 
in the second column, in which only a weaker 
downward trend would be expected. 

There are a number of statistical procedures 
by means of which the agreement of the two 
hypotheses with the data could be tested. 
However, their presentation would have re- 
quired much space, mainly because of two 
difficulties: the groups of drivers overlap, so 
that the figures in the second column are not 
independent, and the theoretical distributions 
of the time intervals are not normal. It was 
not thought that the expected gain from the 
treatment of the data embodying these con- 
siderations was likely to justify the added 
space. Therefore the material is treated in 
terms of a simple inspection of the table. 

The expected tendency towards decreasing 
time intervals between the higher numbered 
accidents is present. As the incidence of ac- 


have more time left in which they may have addi- 
tional accidents; second: the drivers in this study 
were not new and may have developed differences in 
accident proneness as a result of accidents occurring 
before the observation period. 


cidents rises, the time interval before the first 
accident also tends to grow shorter, to about 
the same extent. As one reads down the 
table, there is no tendency toward larger 
negative values of differences. There are 
some fluctuations in the values of these dif- 
ferences, but these fluctuations are not large, 
do not suggest an intelligible pattern, and 
according to tentative computations do not 
seem to be statistically significant. 


Discussion 


These results are clearly not in favor of 
the hypothesis of increased accident suscepti- 
bility with accidents. For this set of data the 
theory of proneness, varying from person and 
reasonably-constant for each person appears 
to be more appropriate. 

This conclusion requires qualifications. It 
should be noted that certain factors were not 
taken into consideration in this study. The 
possibility of seasonal fluctuations in accident 
rates was one such factor. Another factor not 
considered had to do with the different dis- 
tances driven by the different drivers. Both 
of these factors are likely to have functioned 
as sources of variation in the accident rates, 
and taking them into account should have 
given a somewhat better test of the hypothe- 
sis of constant accident proneness of indi- 
viduals. 





Time Intervals between Accidents 


Table 2 


Comparison of Mean Time Intervals in Weeks Before First Accident and Before Later Accidents 








Mean Time Interval 
Mean Time of the Same Drivers 
Interval Before First Accident Difference 





Before 1st accident 15.1 
(112 drivers) 


Between ist & 2nd 8.9 
(67 drivers) 


Between 2nd & 3rd 
(51 drivers) 


Between 3rd & 4th 
(38 drivers) 

Between 4th & 5th 
(25 drivers) 


Between 5th & 6th 
(21 drivers) 


Between 6th & 7th 
(16 drivers) 


Between 7th & 8th 
(13 drivers) 


Between 8th & 9th 
(11 drivers) 

Between 9th & 10th Le 
(8 drivers) 


Between 10th & 11th 3.2 
(8 drivers) 

Between 11th & 12th 4.0 
(6 drivers) 

Between 12th & 13th 4.8 
(5 drivers) 

Between 13th & 14th 3.2 
(4 drivers) 

Between 14th & 15th 3.0 
(4 drivers) 


Between 15th & 16th 1.7 
(3 drivers) 


Between 16th & 17th 2.0 j 0.0 
(2 drivers) 

Between 17th & 18th 5.5 . 3.5 
(2 drivers) 


Between 18th & 19th, 7; 4,1,3,4,4,1 3. 3.5, —2.5, —2.5, —0.5, 
19th & 20th, etc. —2.5, —2.5, —2.5 
(1 driver) 





However, this hypothesis is probably only few drivers suggest temporary fluctuations of 
an approximation which is not applicable to accident proneness with some individuals. 
all individuals and groups. The records of a Temporary increases in accident proneness 
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may well be due to periods of emotional 
stress. 

However, there is need to investigate the 
statistical significance of such apparent fluc- 
tuations in accident proneness or liability. 
Maguire, Pearson and Wynn (9), Green- 
wood (2), and Irwin (7) have pointed out 
certain difficulties in determining the sta- 
tistical significance of departures of sequences 
of time intervals from randomness, and it is 
not entirely clear to this writer whether the 
problem has been solved. 

The apparent lack of systematic effects of 
accidents on accident rates found in this 
study need not hold for all groups. It may 
have partly resulted from the fact that all 
drivers had worked for the company at least 
a year before the observation period. Con- 
siderations based on the psychology of learn- 
ing suggest that accident proneness might be 
less constant with inexperienced workers. 
Research on time-intervals between accidents 
for inexperienced workers is worth attempt- 
ing. 

Our finding is not in agreement with Horn’s 
conclusion that accident susceptibility is 
temporarily increased by accidents. This 
disagreement with Horn’s conclusion may 
represent a difference between different kinds 
of accidents, since his data dealt with air- 
plane accidents and ours with accidents to 
taxi-drivers. On the other hand, the discrep- 
ancy may be due to different statistical treat- 
ments. Possibly a statistical artifact was 
involved in his conclusions. Horn’s tables 
showed a relative preponderance of short time 
intervals over longer ones between consecutive 
accidents. However, he was apparently not 
aware of the nature of the distribution of the 
time intervals between events distributed at 
random within a period of time, and his tables 
do not indicate whether or not his results 
differed from chance expectancy. The matter 
calls for further investigation. 


Summary 


Much of the evidence in favor of the com- 
monly accepted hypothesis of indivdual dif- 
ferences in accident proneness is only valid if 
one assumes that accident proneness of indi- 
viduals is not affected by accidents in which 
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they are involved. The validity of this as- 
sumption is investigated in terms of a study 
of time intervals between consecutive acci- 
dents of a number of taxi-drivers. Some fea- 
tures of the relevant mathematical theory of 
the random distribution of events in time are 
reviewed. The findings pertaining to the 
time intervals between accidents suggest, 
that, for the group studied, the customary 
assumption of unchanged accident proneness 
following accidents is approximately true. 


Received December 28, 1953. 
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Students enrolled in a School of Business 
Administration generally plan to establish 
careers in business organizations, and aspire 
to become business administrators. Although 
considerable effort is expended in relating 
present student interests to future job suc- 
cess, the relationship of attitudes on signifi- 
cant social issues and future job success is 
apparently largely ignored. The general pur- 
pose of this study is to explore the possibility 
that student attitudes on social issues may 
serve as useful measures for prediction of suc- 
cess in various fields of endeavor. The spe- 
cific purpose of this study is to measure the 
extent to which attitudes of business adminis- 
trators differ from those of students in a 
School of Business Administration. 


Procedure 


Members of a seminar distributed question- 
naires containing 40 statements to 78 business 
administrators and to 146 business school stu- 
dents. Respondents were forced to reply either 
yes or no to each statement. Five statements 
dealt with unionism, 10 with government con- 
trol, 15 with personnel policy, 5 with profit dis- 
tribution, 4 with the free enterprise system, and 
1 with the desirability of business training on 
the college level. 

Business administrators were selected on the 
basis of the size of their establishments and their 
willingness to cooperate. The largest organiza- 
tion in the general vicinity of the seminar mem- 
ber’s home was contacted first. When the top 
administrative officer was unavailable, the ques- 
tionnaire was completed by an individual who 
was second in command. About 80% of the 
firms contacted employed less than 25 persons 
and all of the firms were located in Mississippi. 

The student sample was composed entirely of 
Mississippi State College students who were en- 
rolled in the School of Business and Industry. 
Approximately 60% of the students were upper- 
classmen. As in the case of business administra- 
tors, the student sample is composed entirely of 
students who were willing to cooperate with the 
survey. 
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Results 


Significant or very significant differences 
between responses of business administrators 
and students of business administration were 
found on 20 of the 40 statements contained 
in the questionnaire. Specific details are 
found in Table 1. The statements are num- 
bered in accordance with their appearance in 
the questionnaire. The significance of the 
difference between percentages was estimated 
from the Lawshe and Baker nomograph.’ 

It is noteworthy that significant differences 
between responses of administrators and stu- 
dents were found on every government con- 
trol statement which appeared in the ques- 
tionnaire. In all cases the students professed 
to be more favorably disposed toward gov- 
ernment regulation and control than the ad- 
ministrators. 


Discussion 


Although a marked divergence of attitudes 
is indicated on one-half of the issues pre- 
sented to both students and the administra- 
tors, the effect of this situation upon the 
future success of the students in their roles 
as administrators is not clearly discernible. 
Some shifting of attitudes may take place 
when students are placed on the job. There 
is also the possibility that some shifting of 
attitudes may occur on the part of adminis- 
trators. In the event that student attitudes 
toward government control will not subject 
the students to unfavorable discrimination by 
present business administrators, it may be 
reasonable to believe that the forthcoming 
generation of administrators will be less prone 
to believe that the whole American economy 
will collapse with the further extension of gov- 
ernmental influence in business affairs. 

1 Lawshe, C. H. and Baker, P. C. Three aids in 


the evaluation of the significance between percent- 
ages. Educ. psychol. Measmt., 1950, 10, 263-270. 
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Table 1 


Distribution of Responses to Items in Questionnaire 





Per Cent Replying Yes 











Adminis- 
Item Student trator Diff. 
i. Dade should receive government abédda Watate Weed ahah aks 34% 19% 15%* 
3. Corporations should be taxed higher than individuals. . . . 80 63 i 
4. Price control will destroy the free enterprise system. ..... 35 60 25° 
7. Workers will do their best work only if strict discipline is maintained 
INE oho ie disco e hc wig nev manctirne . 30 46 16* 
9. Employees should have company eponecred retirement plans pee os .. 84 72 12* 
11. More jobs should be covered by the minimum wage law..... . 76 40 36° 
13. Labor unions help PAMISETAA) HPORTONS.. 0. ns 5 ose s einisis cies 75 60 17” 
15. A worker who is ‘‘no good” on one job will probably be “no good” on 
I ee rata ar act in Na chat anc ate eatin overran ee ORCA SN 10 24 14** 
16. The federal government should eubsidine educational institutions... . . 58 40 18** 
23. A Fair Employment Practices Commission should be established in 
88 lo dig Gig eee cae ota atare 8 WAAR OETA wires 35 33 18** 
30. Shifty-eyed persons are dishonest Ricreipaeiccs nt 18 i 
31. Unemployment benefits should be abolished. retina, ote iets e aes: 36 2”" 
32. People with red hair are emotionally unstable................... > os 13 8* 
33. You can tell a person’s intelligence by interviewing him............. 26 42 16** 
35. Government old age pensions should be abolished. _ »- W 23 10* 
36. There should be absolutely no government seated « or re vespiletbian of pri- 
Ea he oy co tae > aaio swine Aine so wsiew swage se 29 45 16* 
37. Labor unions will destroy the free enterprise SR ai. sot ais aunt cao 24 40 16” 
38. Government should compete with private business uhanever the public 
Es oe ee ee ee 69 54 aS” 
39. Profits resulting from increased productivity should be divided equally 
among stockholders, labor, and the consumers..................0-- 28 64 36** 
40. You cannot have democracy without the free enterprise system..... . 73 53 20"* 





* Indicates 5% level of confidence. 
** Indicates 1% level of confidence. 


Summary 


An attitude survey blank containing 4° 
statements and covering the areas of govern- 
ment control, personnel policy, profit distribu- 
tion, labor unionism, and the free enterprise 
system was completed by 78 business ad- 
ministrators and 146 business administration 
students. 

1. Significant differences were found on 20 
of the 40 items contained in the question- 
naire between responses of the two groups. 


2. Disagreement was greatest in the area 
of government control. Students were sig- 
nificantly more favorably disposed to gov- 
ernment control than the administrators. 

3. Despite the marked divergence between 
attitudes of the two groups, the possibility 
exists that some student attitudes may shift 
toward those professed by administrators 
when the students are forced to solve prob- 
lems presently faced by the administrators. 


Received Novembe,; 25, 1953. 
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Student achievement in college courses 
where multiple measurements are taken over 
the semester involves two aspects: consistency 
and variability. Inter-student differences in 
the average level of course achievement over 
several examinations are used as the basis for 
assigning course grades and result in moder- 
ately reliable achievement indices (1). How- 
ever, the reliability of average achievement 
levels is less than perfect due to intra-student 
fluctuation in performance from test to test. 
Glaser (4) has shown that measures of incon- 
sistent responses in a retest situation have a 
moderate, but significant degree of reliability 
which suggests that measures of achievement 
fluctuation within a course may reliably meas- 
ure an important aspect of student achieve- 
ment performance. 

We are suggesting that achievement fluc- 
tuation is particularly important in attempt- 
ing to predict achievement level. The usual 
educational procedure of averaging scores or 
grades on several course tests to arrive at a 
final course grade implies that measures of 
achievement level and of achievement fluctua- 
tion will be related in a nonlinear manner. 
Students receiving A course grades are most 
likely to have shown consistent A or B per- 
formance on each test and those receiving F 
grades to have achieved at D or F levels on 
each test. However, the C or middle group 
includes both students who have received 
consistent C grades on each test and also 
those who have fluctuated widely from A to 
F on separate examinations, but whose aver- 
age grade ends up as a C. We would predict 
a curvilinear relationship between measures 
of achievement level and achievement fluctua- 
tion, with the middle achievement level groups 
showing the largest average fluctuation and 
the extreme level groups (high and low) 
demonstrating significantly smaller fluctua- 
tion. 


If this analysis is correct it then bears im- 
portantly on the problem of predicting stu- 
dent achievement in the first course in psy- 
chology. Aptitude test scores have shown 
moderate, but important rectilinear relation- 
ships with achievement level (2, 9, 10), but 
attempts to use personality and_ interest 
scales to predict level when aptitude is sta- 
tistically held constant have been fruitless 
(7,8). This lack of success may be due to: 
(a) curvilinear relationships existing between 
such scales and achievement level which are 
not revealed by rectilinear correlation tech- 
niques; or (b) these personality and interest 
scales being predictive of only the same type 
of student behavior that is predicted by apti- 
tude tests. The first of these hypotheses 
could be tested by computing curvilinear cor- 
relation coefficients (eta or epsilon) in addi- 
tion to rectilinear Pearsonian coefficients and 
applying standard tests of significance to the 
difference between the pairs of rectilinear and 
curvilinear coefficients. Hypothesis (b) could 
be assessed by correlating the personality 
and/or interest scales with aptitude tests 
known to be related to the achievement cri- 
terion and finding the partial correlation of 
scales and achievement with aptitude sta- 
tistically held constant. However, our as- 
sumed relationship between achievement level 
and fluctuation s iggests a third hypothesis: 
(c) a predictor may be rectilinearly related 
to both level and fluctuations, but because of 
the curvilinear confounding of level and fluc- 
tuation may show a zero correlation with 
level. This third hypothesis could be tested 
by correlating each scale with measures of 
both level and fluctuation and tempering our 
judgments of nonsignificant scale-level cor- 
relations in the light of obtained scale-fluctua- 
tion relationships. 

Several recent studies have suggested that 
the Guilford Zimmerman Temperament Sur- 
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vey (5) may be useful in predicting stu- 
dent achievement in introductory psychology. 
Klugh (6) found three scales on the GZTS 
to correlate positively and significantly with 
total scores on the ACE. For a sample of 
225 male students the Objectivity and Friend- 
liness scales were significant at the .01 level 
(r = .19 and .18), while the Personal Rela- 
tions scale was significant at the .05 level 
(r= .14). Of the remaining seven scales, 
only the Masculinity scale approached sig- 
nificance (ry = .11). Since the ACE is sig- 
nificantly related (R = .47) to achievement 
level in introductory psychology (10) these 
GZTS scales could be expected to show some 
correlation with the same achievement vari- 
able. However, Krumm (8), using the ACE 
as a predictor, obtained discrepancy scores 
between the predicted and obtained grades 
in introductory psychology (N = 410) and 
identified the top and bottom quarters of the 
resulting distribution. Comparing the mean 
GZTS scores of these groups of “over- 
achievers” and “underachievers” showed none 
of the GZTS scales to significantly discrimi- 
nate between these extreme groups. ; 

The problem of the present research was to 
compare the relation of GZTS to achievement 
level in introductory psychology when both 
rectilinear and curvilinear correlation tech- 
niques are used, and to make a similar com- 
parison when achievement fluctuation is used 
as the criterion. 


Procedure 


Subjects. The Ss were 155 students enrolled 
in daytime sections of introductory psychology 
at the University of Pittsburgh during the Spring, 
1953, semester and who were present in class the 
day the personality questionnaire was adminis- 
tered. There were 107 men and 48 women in 
the sample with the great majority being fresh- 
men and sophomores. 

Variables. The Guilford Zimmerman Tempera- 
ment Survey (5) was administered to all sec- 
tions near the beginning of the semester by 
trained examiners. Raw scores on each of the 
ten GZTS scales were used in the later analysis. 

The achievement variables were derived from 
students’ scores on five course achievement ex- 
aminations given during the semester. All five 


1 Appreciation is expressed to Dr. Frederick Herz- 
berg of Psychological Services of Pittsburgh who 
supervised the administration and scoring of the 
temperament survey. 
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tests were objective, 50-item, multiple-choice ex- 
aminations and the raw scores from each test 
were converted to standard scores based upon 
the performance of all students in introductory 
psychology on each single test. The large ma- 
jority of students (N = 126) took all five tests 
during the semester, but graduating seniors (N 
= 22) were excused from the last two tests. The 
remaining students (N= 7) had missed one of 
the tests and had received an incomplete grade 
in the course, but were retained in the present 
sample. Details of the course evaluation pro- 
cedure have been previously described (1). 

The variable of achievement level used in this 
study was the letter grade received by each stu- 
dent. These grades were determined by finding 
the mean of each S’s test standard scores and 
converting this average to a letter grade on the 
basis of previously established cutting points 
which were common to all sections. This achieve- 
ment level variable has been shown to have an 
estimated reliability of .80 (1, p. 316). 

The achievement fluctuation variable was de- 
rived from the range between the highest and low- 
est test scores received by each S over the semes- 
ter. Dixon and Massey (3, pp. 240-241) have 
shown that the range in small samples is a highly 
efficient estimate of the population variability and 
is computationally much simpler than computing 
the standard deviation of each S’s scores. The 
reliability of this fluctuation measure was esti- 
mated by drawing a random sample of 100 Ss 
who had taken all five course tests. The range 
was computed for each S as the difference be- 
tween his highest and lowest scores and a second 
measure of the fluctuation was obtained by find- 
ing the range between the S’s second highest and 
second lowest test scores. Correlating these 
pairs of fluctuation measures for the 100 Ss gave 
a coefficient of .59, which indicates that the 
range is a relatively stable measure of achieve- 
ment fluctuation. Finally, the range for each of 
the 155 Ss in the study sample was multiplied by 
the constants given by Dixon and Massey (3, p. 
240) to give an estimate of the standard devia- 
tion of each S’s achievement test scores. 


Results 


The distribution of achievement fluctua- 
tion measures appeared positively skewed and 
was tested for normality by grouping indi- 
vidual measures into six categories and ap- 
plying the usual chi-square test. Chi-square 
equalled 9.94 which with 3 degrees of free- 
dom was significant at the .05 level. To re- 
duce skewness these measures were put 
through a square-root transformation and the 
transformed distribution tested for normality 
by the usual chi-square method. The chi- 
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square value with three degrees of freedom 
was 5.60 which is not significant at the .05 
level of confidence. The mean and variance 
of the fluctuation measures were computed 
for each of the five achievement level groups 
and an analysis of variance performed to test 
the hypothesized curvilinear relationship be- 
tween level and fluctuation. The F compari- 
son between the means gave an F value of 
4.03 which is significant at the .01 level. 
This F corresponds to a curvilinear correla- 
tion (eta) of .31, while the product-moment 
rectilinear correlation between level and fluc- 
tuation gave a nonsignificant r of — .15. A 
chi-square test of the significance of the dif- 
ference between their curvilinear and recti- 
linear coefficients gave a chi-square of 12.40, 
which, with 3 degrees of freedom, is significant 
at the .01 level. A chi-square test of the ho- 
mogeneity of the variances of the fluctuation 
measures within the five achievement level 
groups yielded a value of 5.61 which, with 
four degrees of freedom, is not significant at 
the .05 level. The mean fluctuation for each 
of the achievement level groups can be found 
in Table 1. As hypothesized, the middle 
achievement level groups show the largest 
average fluctuation and the extreme level 
groups (A and F) demonstrated significantly 
less average fluctuation. 

Since our two measures of achievement, 
level and fluctuation, are nonlinearly related, 


Table 1 


Distribution of Achievement Fluctuation Measures 
for the Achievement Level Groups 








Achieve- 
ment 
Level 


Number of 
Subjects 


A 21 
B 32 
Cc 62 
D 26 
F 14 
Significance 
of Means (F) 
Homogeneity 
of Variances 
(chi-square) 


Standard 
Deviation 
3.12 91 
3.68 91 
3.83 88 
4.05 .60 
3.49 .92 


Fluctuation 
Mean 








4.03** 





** Significant at the .01 level. 
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it was necessary to perform a further trans- 
formation on the fluctuation measures to 
clarify the relation of GZTS scales to these 
two criteria. To insure a zero correlation be- 
tween level and fluctuation each S’s fluctua- 
tion measure was taken as a deviation (plus 
or minus) from the mean fluctuation of his 
achievement level group. Since the variances 
of the fluctuation measures within the level 
groups appeared to be homogeneous, the fur- 
ther step of dividing each deviation by the 
standard deviation of its level group was not 
necessary. These deviations from all five 
level groups were then pooled and divided 
into five fluctuation groups. Group I in- 
cluded the 20 Ss showing the largest intra- 
subject achievement fluctuation (independent 
of achievement level), Group V comprised 17 
Ss with the smallest intra-subject fluctuation, 
with Groups II, III, and IV consisting of Ss 
showing intermediate amounts of fluctuation. 

Raw scores on the ten GZTS scales were 
then correlated with the criterion measures 
of achievement level and achievement fluctua- 
tion. Rectilinear product-moment correlations 
were computed by weighting achievement 
level groups A through F with unit digits 4 
through O and similarly weighting achieve- 
ment fluctuation groups I through V with the 
same weights. These weights were then cor- 
related with the raw GZTS scores. In addi- 
tion, curvilinear correlations (eta) between 
the GZTS scales and the two criterion meas- 
ures were computed and chi-square tests of 
the significance of the difference between the 
rectilinear and curvilinear coefficients evalu- 
ated. These correlations and tests of curvi- 
linearity are given in Table 2. It can be 
noted that the GZTS Objectivity scale is 
rectilinearity related to achievement level and 
this is probably also true of the Restraint 
scale. Friendliness and Masculinity are re- 
lated to level in a curvilinear fashion, but the 
product-moment coefficients for these two 
scales are not significant. None of the GZTS 
scales are related to achievement fluctuation 
when only the product-moment coefficients 
are considered, but Ascendance, Social Inter- 
est, and Emotional Stability are curvilinearly 
related to fluctuation. 
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Table 2 
Rectilinear and Curvilinear Correlations between Guilford-Zimmerman Scales and 
Achievement Level and Fluctuation 








Achievement Level 





Achievement Fluctuation 











Product- Significance Product- Significance 
Moment Curvilinear of Differ- Moment Curvilinear of Differ- 
Correlation Correlation ence Correlation Correlation ence 
GZTS Scale (r) (Eta) (Chi-Square) (r) (Eta) (Chi-Square) 
General Activity —.13 18 2.55 — .02 18 4.83 
Restraint .20* .24* 2.56 — .03 08 0.93 
Ascendance —.13 19 2.80 .00 a 18.08** 
Social Interest —.14 20 2.98 2 a7 9.20* 
Emotional Stability 15 17 2.13 05 -_ 7.98* 
Objectivity Br ing .28* 5.34 03 07 0.58 
Friendliness a ae? 9.16* 4S Be 1.29 
Thoughtfulness —.02 04 0.18 09 14 1.84 
Personal Relations 08 19 4.68 12 17 2.46 
Masculinity 15 25° 5.79 — .09 12 0.90 








* Significant at the .05 level. 
** Significant at the .01 level. 


Discussion 


Our results confirm the hypothesis of a sig- 
nificant curvilinear relation between achieve- 
ment level and achievement fluctuation. This 
indicates that our level criterion is an impure 
measure of achievement differences between 
students and suggests that similar criteria 
used widely in educational research are simi- 
larly contaminated by the fluctuation vari- 
able. Nor is intra-student fluctuation a 
chance phenomenon: the moderate but sig- 
nificant correlation between two similar meas- 
ures of fluctuation that was found in this 
study shows its reliability. 

The correlations in Table 2 bear on points 
(a) and (c) made in the third paragraph of 
this paper. Two GZTS scales, Restraint and 
Objectivity, are rectilinearly related to our 
contaminated level criterion, but two addi- 
tional scales, Friendliness and Masculinity, 
show insignificant rectilinear, but significant 
curvilinear correlations with level. This con- 
firms point (a), since neither of these last 
two scales would have appeared to be re- 
lated to level if curvilinear correlation tech- 
niques had not been used. However, point 
(c), as expressed in the second paragraph, is 
not confirmed, since none of the GZTS scales 
are rectilinearly related to fluctuation. The 


significant curvilinear correlation of three 
GZTS scales, Ascendance, Social Interest, and 
Emotional Stability, with fluctuation suggests 
that point (c) is too naively stated. Omit- 
ting the word “rectilinearly” in point (c) 
yields a hypothesis that appears plausible 
in view of our findings. Perhaps these three 
GZTS scales show essentially zero relation- 
ships with the impure criterion of level be- 
cause of their curvilinear correlation with the 
pure measure of fluctuation. These results 
do not confirm point (c), but suggest that a 
modified form of the hypothesis is tenable. 
The available data did not permit a direct 
test of point (b). However, there are sug- 
gestive consistencies and discrepancies be- 
tween our results and previous studies (6, 8, 
10) that indirectly bear on this point. Klugh 
(6) found the Objectivity, Friendliness, and 
Personal Relations scales to be significantly 
related to the total score on the ACE, and 
the Masculinity scale to fall just short of sta- 
tistical significance. Russell and Bendig (10) 
demonstrated a significant relation between 
the ACE and our level criterion, while Krumm 
(8) showed the GZTS scales were not predic- 
tive of achievement level in introductory psy- 
chology when the variability in level attribut- 
able to ACE differences was statistically elimi- 
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nated. We found the Restraint, Objectivity, 
Friendliness, and Masculinity scales signifi- 
cantly related to level when academic apti- 
tude is uncontrolled. These results suggest 
that the Friendliness and Objectivity scales 
on the GZTS measure the same aspects of 
achievement performance that is measured by 
the ACE and could not profitably be used in 
a regression equation along with the ACE to 
predict achievement level in introductory psy- 
chology. However, the Restraint and Per- 
sdnal Relations scales probably would in- 
crease the predictability of level if used in 
conjunction with the ACE: the Restraint 
scale because of its significant correlation 
with level and its low correlation with ACE, 
while the Personal Relations scale could act 
as a suppressor variable due to its lack of 
correlation with level and its significant re- 
lation to the ACE. Admittedly the predic- 
tive usefulness of these two scales is un- 
proven, but provides a hypothesis to be 
tested in a later sample. 


Summary 


Scores on the Guilford-Zimmerman Tem- 
perament Survey were correlated by both 
rectilinear and curvilinear methods with meas- 
ures of course achievement level and intra- 
student achievement fluctuation in introduc- 
tory psychology (N= 155). Achievement 
level and fluctuation were curvilinearly re- 
lated and the fluctuation measures were ad- 
justed to remove this artifact. Two GZTS 
scales, Restraint and Objectivity, were recti- 
linearly related to level (r = .20 and .21), 
while two additional scales, Friendliness and 
Masculinity, showed significant curvilinear 
correlations with level (eta = .27 and .25). 
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None of the GZTS scales were rectilinearly 
related to fluctuation, but three scales, As- 
cendance, Social Interest, and Emotional Sta- 
bility, were curvilinearly correlated with fluc- 
tuation (eta = .35, .27, and .24). 
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This article describes an attempt to de- 
velop scales for the Minnesota Multiphasic 
Personality Inventory (MMPI) which meas- 
ure a person’s ability to get along well with 
others. Such scales should prove valuable in 
selecting personnel who must deal with the 
public or work harmoniously and effectively 
with a group. The validity of the scales re- 
ported here is based on their power to predict 
the rapport of teachers with pupils in a class- 
room. Since the content of the items is not 
directly related to school work, it is believed 
that the scales may prove useful also in the 
selection of sales people, officers and non- 
commissioned officers in the armed forces, 
foremen, and other personnel who must be 
able to establish rapport with others and 
maintain group morale. The scales are pub- 
lished here with a view to encouraging fur- 
ther experimentation in other situations. 

For a number of years a series of research 
studies, centering at the University of Min- 
nesota, has been carried on with the isolation 
and measurement of non-intellectual factors 
related to success in teaching as its principal 
object. A major outcome of this work has 
been the development of the Minnesota 
Teacher Attitude Inventory (MTAI) (2), 
which has been found to predict teacher- 
pupil rapport with a degree of validity indi- 
cated by correlations with independent cri- 
teria of from .50 to .63. When the MTAI 
was standardized on a large sample of Minne- 
sota teachers (1), it was possible to identify, 
in the extremes of the distribution, two groups 
of teachers sharply differing in their ability 
to get along with pupils. The MMPI (3) 
was administered to these two groups, and 


* This study was made possible by a grant from 
the research funds of the Graduate School of the 
University of Minnesota. 
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212 completed inventories, 112 representing 
approximately the 8 per cent of teachers scor- 
ing highest, and 100 the 8 per cent scoring 
lowest (among all of the public school teach- 
ers in Minnesota) on the MTAT obtained. 

The MMPI contains 550 items of the True- 
False type with a wide variety of content. 
When the proportions making each response 
in each of the two groups of teachers were 
compared, after being transformed to angles 
by the arc-sine transformation (5), the dif- 
ference between the groups was found to be 
significant at the 5 per cent level on 250 
items. 

The teacher who scores low on the MTATI 
describes himself in his responses (2) as gen- 
erally hostile toward others; he says that 
pupils are dishonest, insincere, untrustworthy, 
lazy, etc. His self-description also indicates 
that he: (a) adheres excessively to rigid 
standards of morality; (b) tends to dominate 
those below him and be subservient to those 
above him; and (c) prides himself on a thor- 
ough knowledge of his subject-matter. Among 
the 250 discriminating MMPI items were 
many which reflected generalized hostility to- 
ward people, and others that suggested Phari- 
saic virtue. There were no items which re- 
flected the tendency toward security through 
power over people or through mastery of sub- 
ject matter. The other discriminating items 
suggested symptoms of depression, anxiety 
and general neurosis. 

A total of 77 items which most obviously 
reflected hostility were chosen for a pre- 
liminary “Ho” scale, and 60 items having to 
do with virtue and morality were chosen for 
a “Pv” scale. When the MMPI answer 
sheets completed by 200 graduate students in 
education (all of whom were experienced 
classroom teachers) were scored on these two 
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keys, correlations of — .45 (for “Ho”) and 
— 49 (for “Pv”) with the MTAI were ob- 
tained. On the strength of these results, fur- 
ther refinement of the scales was undertaken. 
Five clinical psychologists, working independ- 
ently, selected sets of “Ho” and “Pv” items. 
On the basis of agreement among the five, 
a final 50-item Ho key was selected. The 
reliability coefficient of this scale for the 200 
graduate students, estimated by analysis of 
variance (4), was .86. 

Substantial agreement among the psycholo- 
gists could be obtained on only 20 of the Pv 
items. On the basis of an internal consist- 
ency item analysis carried out on the 200 
graduate students, 30 items were added to 
produce a 50-item Pv key. The internal con- 


Table 1 


Males 
N = 100 


Correlation 


Females 
Coefficients y 


N= 100 N = 200 
Ho vs. MTAI — 44 - 
Pv vs. MTAI — .38 -. 
Ho vs. Pv 65 
Ta vs. MTAT — 45 
Multiple R, Pv + Ho 
vs. MTAI Regression 
coefficients 
Beta weight for Ho 
Beta weight for Pv 


mn 


— 44 
— .46 
.69 
50 


w 


4 
54 

RY 

B 


—_ 


— 46 — .55 
— 335 — .109 
— .163 — .463 


sistency of this scale could not be estimated 
on the 200 papers used in the item analysis, 
so the papers of 55 other graduate students 
in education were used for this purpose. The 
reliability, estimated by analysis of variance, 
was .88. 

Direct evidence regarding the validity of 
the Ho and Pv scales for predicting pupil- 
teacher rapport as measured by the MTAI, 
and indirect evidence as to their validity for 
measuring “Hostility” and “Pharisaic virtue,” 
was obtained by correlating the scores of the 
200 graduate students on the two scales with 
their scores on the MTAI. The results are 
summarized in Table 1. 

The sample contained 100 males and 100 
females; correlations and beta weights are 
presented for the two sexes separately in the 


Table 2 
Items Included in the Ho (Hostility) Key for the 
Minnesota Multiphasic Personality Inventory * 
(Listed according to number on the Group Form) 








19 136 265 386 455 
28 148 271 394 458 
52 157 278 399* 469 
59 183 280 406 485 
71 226 284 410 504 


89 rE aa 292 411 507 
93 244 319 426 520 
110 250 348 436 531 
117 252 368 438 551 
124 253* 383 447 558 


* Items marked with an asterisk are keyed “False”; 
all other items are keyed “True.” 


first two columns, and correlations for the en- 
tire sample in the third column. 

The Ho scale tends to be more effective for 
males than the Pv scale, while the reverse 
holds for females, although none of the sex 
differences is statistically significant. In the 
multiple regression equation for predicting 
MTAI scores from Ho and Pv scores for 
males, the addition of the Pv scale does not 
significantly improve on the prediction from 
the Ho scale alone. In the multiple regres- 
sion equation for predicting MTAI scores 
from Ho and Pv scores for females, the addi- 
tion of the Ho scale does not significantly im- 
prove on the prediction from the Pv key 
alone. 


Table 3 


Items Included in the Pv (Pharisaic Virtue) Key for the 
Minnesota Multiphasic Personality Inventory * 


(Listed according to number on the Group Form) 


147 35 401* 468 
158 3: 402 470 
176* d 404 492 
206 375 413 499 
232 K 414 502 


289 3 416 506 
111 317 : 439 509 
112 336 i 443 510 
119 337 39% 457 548 
129 338 ! 461 564 





* Items marked with an asterisk are keyed “False”; 
all other items are keyed “True.” 
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Table 4 
Norms for Hostility (Ho) Scale of MMPI 


~— - -_ 














T Score T Score T Score 

Raw —_—_—_——— Raw : ™ Raw —_—_— 
Score M F Score M F Score M F 
50 93 95 33 69 71 16 . 4 47 
49 91 94 32 68 70 15 44 46 
48 90 892 31 66 8668 14 43 45 
47 8991 30 65 67 13 41 43 
46 87 89 29 64 = 66 12 40 42 
45 86. 88 28 62 64 11 39 40 
44 84 87 27 61 63 10 37 39 
43 83 85 26 59 61 9 36 38 
42 82 84 25 58 60 8 35 36 
41 80 §=82 24 57 59 7 33 35 
40 79s 23 SS 57 6 32 33 
39 77 80 22 54 56 5 30 32 
38 76 78 21 53 54. 4 29 31 
37 75 77 20 51 53 3 28 29 
36 73 75 19 50 52 2 26 28 
35 72 74 18 48 50 1 25 26 
34 71 73 17 47 49 0 23 25 


| 





The correlations obtained with multiple re- 
gression weights on both scales combined are 
practically identical with those obtained when 
the two scales are thrown together into one 


the best predictor of teacher-pupil rapport for 
both sexes is desired, the Ta scale would 
probably be the most satisfactory. 

The magnitude of the intercorrelation be- 




















100-item “Ta” (teacher attitude) scale. If tween the two scales is enough smaller than 
Table 5 
Norms for Pharisaic-Virtue (Pv) Scale of MMPI 

T Score T Score T Score 

Raw — — Raw —_- Raw _—_—— 
Score’ M F Score M F Score M F 
50 99 91 33 73 66 16 46 41 
49 98 90 32 71 64 15 45 39 
48 96 88 31 70 63 14 43 38 
47 94 87 30 68 62 13 41 36 
46 93 85 29 66 60 12 40 35 
45 91 84 28 65 59 11 38 34 
44 90 82 27 63 57 10 37 32 
43 88 81 26 62 56 9 35 31 
42 87 79 25 60 54 8 34 29 
41 85 78 24 59 53 7 32 28 
40 84 76 23 57 51 6 31 26 
39 82 75 22 55 50 5 29 25 
38 80 73 21 54 48 4 27 23 
37 79 72 20 52 47 3 26 22 
36 77 70 19 51 45 2 24 20 
35 76 69 18 49 44 1 23 19 
34 74 67 17 48 42 0 21 17 
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their reliabilities to suggest that they are 
measuring different, although highly related, 
dimensions of personality. 

Lists of the items included in the two scales 
are presented as Tables 2 and 3, all items be- 
ing keyed “true” except those marked with 
an asterisk. The numbers given are those on 
the group form of the MMPI (3). A key 
for either of the scales may be easily pre- 
pared by making a scoring stencil perforated 
as indicated in these tables. 

If it is remembered that these items repre- 
sent the individual’s own description of him- 
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self, some insight into the personality of the 
individual who scores high on one of these 
scales may be obtained by reading the items. 

Typical items on the Ho scale are the fol- 
lowing: “I would certainly enjoy beating a 
crook at his own game,” ‘When someone does 
me a wrong I feel I should pay him back if I 
can, just for the principle of the thing,” “I 
have often met people who were supposed to 
be expert who were no better than I.” Thus 
revealed, the hostile person is one who has 
little confidence in his fellowman. He sees 
people as dishonest, unsocial, immoral, ugly, 


Table 6 
Norms for the Teacher Attitude (Ta) Scale (Ho plus Pv Scales) of MMPI 








T Score 


M F 


Raw 
Score 


T Score 
Raw — 
Score M 


T Score 
M F 





100 98 66 
99 97 65 
99 96 64 

95 63 

97 95 
96 94 61 
95 60 
95 92 59 
94 91 58 
93 91 57 
92 56 
89 55 
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69 28 42 
68 27 41 
67 § 26 41 
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23 38 
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21 37 
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and mean, and believes they should be made 
to suffer for their sins. Hostility amounts to 
chronic hate and anger. 

Among the 20 “core” items on the Pv scale 
are items like the following: “I believe that a 
person should never taste an alcoholic drink,” 
“Sexual things disgust me,” and “I deserve 
severe punishment for my sins,” suggesting 
preoccupation with ideas of sin and punish- 
ment; among the 30 items added by item 
analysis such items as “I am inclined to take 
things hard,” “It makes me nervous to have 
to wait,” and “Dirt frightens or disgusts me,” 
suggest general neurosis. 

Norms for the Ho, Pv, and Ta scales were 
derived on a sample of the same normal 
group that was used in deriving the norms for 
the original clinical scales of the MMPI. 
The sample consisted of 541 individuals, 226 
males and 315 females. These norms for 
males and females are presented in Tables 4, 
5, and 6. 


Summary 


The development of two keys for the Min- 
nesota Multiphasic Personality Inventory by 
selecting principally on the basis of content 
two sets of 50 items from 250 found to dis- 
criminate significantly between teachers scor- 
ing high and teachers scoring low on the Min- 
nesota Teacher Attitude Inventory is de- 
scribed and the items are listed. The Ho 
scale (Hostility) reveals a type of individual 
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characterized by a dislike for and distrust of 
others. The Pv scale (Pharisaic virtue) re- 
veals a type of person who describes himself 
as preoccupied with morality and-ridden with 
fears and tensions. A Ta (Teacher atti- 
tude) scale made up of all 100 items is also 
proposed. When administered to a rather 
homogeneous group of graduate students in 
education classes, the internal consistency 
reliability coefficients of the two short scales 
were estimated to be .86 (for Ho) and .88 
(for Pv), and the Ho, Pv, and Ta scales cor- 
related — .44, — .46, and — .50, respectively, 
with the Minnesota Teacher Attitude In- 
ventory. 


Received November 3, 1953. 
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Two relatively recent developments in 
counseling theory relate to the importance of 
understanding the individual’s “level of vo- 
cational aspiration” and “job values and de- 
sires.” These two personality dimensions are 
assuming increasing importance in our at- 
tempts to explore the dynamics of vocational 
selection and adjustment. 

Vocational counselors have long been aware 
of the importance of level of vocational as- 
piration as a guidepost in making long range 
plans because realism in level of aspiration 
does much to overcome pressures for the se- 
lection of vocational goals which might lead 
to much frustration. Decause of the impor- 
tance of vocational aspiration level to mental 
health, research in this area is greatly needed, 
particularly as an aid in disclosing the rela- 
tionship of aspiration level to other aspects 
of vocational selection. 

Another area where research is needed is 
that of job values and desires which are also 
of great importance in making vocational 
plans. By job values and desires are meant 
the answers given to the basic question, 
“What do you really want from a job?” Job 
values and desires refer not to the kind of 
work or duties performed, but to the source 
of satisfaction in the work and are defined in 
this study as the choices listed in the follow- 
ing Job Values and Desires Checklist. 


Centers’ Job Values and Desires Checklist 


If you had a choice of one of these kinds of 
jobs, which would you choose? (Put a number 
“1” by your FIRST choice. If you have OTHER 
choices which you would like to indicate, put a 
number “2” by your second choice and a number 
“3” by your third.) 


——A. A job where you could be a leader. 
—B. A very interesting job. 


—C. A job where you would be looked upon 
very highly by your fellow men. 

A job where you could be boss. 

A job which you were absolutely sure 
of keeping. 

A job where you could express your 
feelings, ideas, talent, or skill. 

A very highly paid job. 

A job where you could make a name 
for yourself—or become famous. 

A job where you could help other peo- 
ple. 

A job where you could work more or 
less on your own. 


—F. 


—G. 
—H. 


—I. 
——J. 


Centers? has done extensive work on this 
problem with adults from different social 
classes as well as from rural and urban en- 
vironments. His major finding was that self- 
expression is a “middle class” job value while 
security is a “working class” value. 

The present study attempts to examine the 
job values and desires of adolescents in rela- 
tion to level of aspiration as measured by the 
Level of Interest section of the California 
Occupational Interest Inventory. The prob- 
lem being explored here is whether differences 
in level of aspiration are reflected in differ- 
ences in job values and desires. It is hoped 
that some understanding of the relationship 
between the concepts of level of interest and 
of job values may follow from such an ex- 
ploration. 

Some justification is needed for considering 
the Level of Interest section as a measure of 
vocational aspiration. The manual?’ for the 
interest inventory gives no evidence indicat- 
ing that scores on the Level section are in 
any way associated with differences in as- 


1R. Centers. Psychology of social class. Prince- 
ton, New Jersey: Princeton University Press, 1949, 
219 pp. 

2E. A. Lee and L. A. Thorpe. Manual of direc- 
tions—Occupationcl Interest Inventory, Advanced 
Series. Hollywood: California Test Bureau, 1943. 
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piration level. However, Stefflre * found that 
when scores on the Level of Interest section 
were compared to an independent measure of 
vocational aspiration—the client’s vocational 
objective—subjects aspiring to the higher 
level occupations made significantly higher 
scores. Stefflre concluded, in speaking of the 
Level section, “This section of the test would 
appear to be a good rough index of the direc- 
tion and extent of the student’s aspiration as 
it will be expressed through the selection of a 
vocational objective.” This research on over 
1,000 high school seniors suggests that the 
Level of Interest section on the Lee-Thorpe 
Occupational Interest Inventory is an ade- 
quate measure of vocational aspiration. 

The present study compared the job values 
and desires of seventeen- and eighteen-year- 
old Caucasians who scored in the lower quar- 
ter on the Level of Interest section of the 
California Occupational Interest Inventory to 
similar groups scoring in the upper quarter 
on the same section. The null hypothesis is 
that differences in level of aspiration are un- 
related to the preference for job values and 
desires. The sample was composed of 212 
male high school seniors and 242 female high 
school seniors from the Los Angeles City 
Schools. Separate analyses were made for 
males, for females, and for a combined sample 
of both sexes. Chi square with the Yates 
correction was applied to examine the rela- 
tionships. 

All subjects had participated in a special- 
ized vocational guidance program made avail- 
able to them during the 1952-53 school year. 
The guidance program consisted of seven 
steps: (1) initial structuring meeting during 
which the entire counseling program was ex- 
plained; (2) basic testing which measured 
mental capacity, interest, and temperament; 
(3) initial interview with a counselor to re- 
late test results and personal-social back- 
ground to tentative vocational objectives; 
(4) study of occupational information; (5) 
additional testing as needed; (6) final inter- 
view to plan objectives and training; and 


8B. Stefflre. Psychological factors associated with 
aspiration for socio-economic mobility. Unpublished 
dissertation, University of Southern California, June 
1953. 
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(7) invitation to the parents to discuss the 
student’s plans with the counselor. 

During the basic testing period, the meas- 
ure of interest used was the California Occu- 
pational Interest Inventory. This test has 
six fields of interest: Personal-Social, Natu- 
ral, Mechanical, Business, Arts, and Science— 
three types of interest—Verbal, Manipula- 
tive, and Computational—and a Level of In- 
terest section which has been discussed above 
as a measure of vocational aspiration. The 
present study was only concerned with this 
last section of the test. 

Centers’ Job Values and Desires Checklist 
was used as the index of the student’s job 
value preferences. The card was presented 
and checked during the first interview. 

Consistently the percentage of respondents 
selecting category B—“Interesting experi- 
ence’”—and category F—‘Self-expression”— 
was far above the percentage selecting any of 
the other categories. This finding was appar- 
ent for both the males and females as well 
as for the combined group. Only the lower 
quarter male group did not show a strong 
preference for “self-expression.” The three 
categories selected least often were “power,” 
“leadership,” and “esteem.” 

Table 1 summarizes the results for the 
males. Chi square was significant in two of 


Table 1 


Chi Square of Upper and Lower Quarters on Level of 
Interest Section and Job Values and 
Desires for Males 








Upper 





Lower 

Quarter Quarter 

(N = 148) (N= 64) 
Category % % 
A. Leadership . 2 
B. Interesting Experience 18 27 
C. Esteem 2 3 
D. Power 4 5 
E. Security 12 12 

F. Self-Expression 29 » ae 
G. Profit 11 8 
H. Fame 4 6 
I. Social Service 7 9 
J. Independence 8 19* 





* Significant at 5 per cent level. 
** Significant at 1 per cent level. 
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Table 2 


Chi Square of Upper and Lower Quarters on Level of 
Interest Section and Job Values and 
Desires of Females 


Lower 
Quarter 
(N = 105) 
% 


Upper 
Quarter 
(N = 137) 


Category % 


A. Leadership 
. Interesting Experience 
*. Esteem 
). Power 
=. Security 
*. Self-Expression 
x. Profit 
. Fame 
Social Service 
Independence 





Note: No differences between upper and lower quar- 
ters were significant. 


the comparisons. On category F—“A job 
where you could express your feelings, ideas, 
talent, or skill”’—P was significant beyond 
the 1 per cent level of confidence. Selection 
of this value is positively related to high vo- 
cational aspiration level. On item J—‘“A job 
where you could work more or less on your 
own’’—chi square was significant at the 5 
per cent level of confidence. Here the males 
falling in the bottom quarter on vocational 
aspiration tended to select the value of job 
“independence” more often than the group in 
the upper quarter. 

Table 2 summarizes the findings for the 
females. It is apparent from the results that 
scores on the Level of Interest section had no 
significant relationship to the selection of job 
values and desires. 

Table 3 presents the results for the com- 
bined group of males and females. Two of 
the comparisons were statistically significant. 
Category A—“A job where you could be a 
leader”—was preferred by more subjects than 
would be expected who fell in the upper quar- 
ter on the aspiration measure. By the same 
token, those falling in the lower quarter 
tended to underselect this particular value. 
“Leadership,” it will be recalled, showed no 
relationship to vocational aspiration score 
when considered for males and females sepa- 


421 


rately, and was one of the categories least 
selected by all groups. 

Category F—‘A job where you could ex- 
press your feelings, ideas, talent, or skill”— 
was significantly overselected by the group 
scoring in the upper quarter on the Level 
section while this same job value was of little 
concern to those scoring in the bottom quar- 
ter on the aspiration measure. It will be re- 
called that “self-expression” was significantly 
related to score on Level of Interest for the 
males also, although not for the females. 

Summarizing the findings then, males who 
demonstrate high level of vocational aspira- 
tion are relatively more concerned with job 
values and desires that involve ‘‘self-expres- 
sion.” On the other hand, males who demon- 
strate low vocational aspiration are relatively 
more concerned with the job value of “inde- 
pendence.” For adolescent females there ap- 
pears to be no significant relationship be- 
tween aspiration level and job values. For 
the combined group of males and females, 
desires for “leadership” and “self-expression” 
are positively related to ‘high vocational as- 
piration. 

‘The negative findings for females may mean 
that adolescent girls select job values from 
very personal motives unrelated to aspira- 
tions for social status. Since the eventual 


Table 3 
Chi Square of Upper and Lower Quarters on Level of 
Interest Section and Job Values and Desires 
for Combined Group 








Lower 
Quarter 
(N = 169) 
@) 


1 
30 
4 


2 
12 


Upper 
Quarter 
(N = 285) 

Category % 


A. Leadership 
. Interesting Experience 


NR 


C. Esteem 
. Power 
>. Security 
*, Self-Expression 
. Profit 
. Fame 
Social Service 
Independence 


no . - 
cCnronnnuw)]s 





* Significant at 5 per cent level. 
** Significant at 1 per cent level. 
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socio-economic status of a girl is more likely 
to be determined by marriage than by her 
occupation, it is possible that her strivings 
for social status are not reflected in voca- 
tional values and desires. 

In a review of the findings, certain similari- 
ties to Centers’ results become apparent. It 
must be kept in mind in making this com- 
parison that the present study examined the 
relation of expressed job values to a Level of 
Interest scale whose connection with ultimate 
occupational status and socio-economic status 
has not been established, while Centers 
studied the relation of expressed job values 
to known socio-economic status (middle class 
or working class status). The present study 
found a preference for “self-expression” (in 
males and in combined sex sample) and for 
“leadership” (in combined sex sample) to be 
related to a high level of vocational aspira- 
tion; Centers found a preference for “self- 
expression” and to some extent, “leadership” 
to be related to membership in the middle 
class. The present study found preference 
for “independence” to be related to a low 
level of vocational aspiration; Centers noted 
a tendency for “leadership” preference to be 
related to membership in the working class. 
The findings of the two studies, when com- 
pared in this manner, suggest the need to ex- 
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amine more closely the relationship between 
level of vocational interest in adolescents and 
socio-economic status. It is possible that the 
adolescent with a high level of vocational 
aspiration identifies himself with the middle 
class and hence views job values in the man- 
ner of adult middle class members while the 
adolescent with a low level of vocational as- 
piration may identify himself with the work- 
ing class. 


Summary 


This study has examined the relationship 
between: (1) level of aspiration, as measured 
by the Level of Interest section of the Occu- 
pational Interest Inventory; and (2) job 
values and desires, as measured by the check- 
list developed by Centers. For the male 
group it was demonstrated that a relation- 
ship exists between these two variables for 
some job values and that this relationship 
seems to be in line with that noted by Centers 
when he examined the role of socio-economic 
differences in the selection of job values and 
desires. Such a finding gives some tentative 
and indirect support to the belief that scores 
on the Level of Interest section may indicate 
the socio-economic status with which the ado- 
lescent male identifies himself. 


Received December 3, 1953. 
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Permanence of Strong Vocational Interest Blank Scores’ 


Kalmer E. Stordahl 


University of Minnesota? 


In assisting young men and women to make 
appropriate vocational and educational choices 
counselors make extensive use of tested inter- 
ests. One of the problems with which both 
the counselor and counselee are concerned is 
the permanence or stability of the scores on 
the measuring instrument. 

The present study was designed to give an 
estimate of the permanence of the scores of 
pre-college males on Strong’s Vocational In- 
terest Blank. For an excellent review of the 
literature up to 1943 on the permanence of 
Strong scores see Strong (7). Two recent 
studies of the permanence of scores over a 
number of years have also been published by 
Strong (5, 6). 


Method 


In the spring of 1949, the Vocational Interest 
Blank was offered on an optional basis to all 
high school seniors who participated in the state- 
wide testing program in the state of Minnesota. 
Approximately 3,500 senior boys completed the 
blank. 

A check was made of the University enroll- 
ment in the spring of 1951 and it was found that 
there were 331 boys enrolled who had completed 
the blank in 1949. To determine whether or not 
the interests of those boys who moved from a 
predominantly rural environment to a metro- 
politan one had changed more than the interests 
of those boys who remained in a metropolitan 
environment, the 331 boys were divided into two 
groups. Those who graduated from high schools 
in Minneapolis, St. Paul, and their immediate 
suburbs were designated as a “metropolitan” 
group (N=250). The second group, all of 
whom had graduated from high schools in cities 
of less than 20,000, were designated as a “non- 
metropolitan” group (N = 81). 

A random sample of 125 boys was chosen from 
the meiropolitan group. These boys plus the 81 
non-metropolitan group were contacted in the 
spring of 1951 and asked to complete the Strong 
blank; 182, 88 per cent, complied. One blank 


1 This paper is based upon a portion of a PhD. 
thesis submitted to the graduate faculty of the Uni- 
versity of Minnesota. The author wishes to ac- 
knowledge the guidance of his advisor, Dr. Willis E. 
Dugan. 

2 Now at Arkansas Polytechnic College. 


was unusable so that the scores of 181 boys 
were used in the study. 

The minimum time between test and retest 
was two years and the maximum time did not 
exceed 2.5 years. The mean ages, at the time of 
the retest, of the metropolitan and non-metro- 
politan groups were 19.7 and 19.9 respectively. 
This difference was not statistically significant 
(P > 05). 

The median high school percentile rank of the 
metropolitan boys was 74.1 and that of the non- 
metropolitan boys was 79.9. The mean ACE 
Psychological Examination score of the metro- 
politan group was 119.69 and the mean for the 
non-metropolitan group was 122.06. These differ- 
ences were not statistically significant (P > .05) 

The tests and retests for the 181 subjects were 
scored on 44 occupational keys and for Interest 
Maturity, Occupational Level, and Masculinity- 
Femininity. Also, using Darley’s criteria (1), 
judgments of patterns were made for the eleven 
occupational interest groups. All judgments were 
made independently by two persons and in those 
cases where disagreement was found a third per- 
son made a third independent judgment. When 
more than two judges were needed the pattern 
was designated as that on which two of the three 
judges were in agreement. Thus, each of the 
eleven interest groups for each subject was scored 
as being a primary, secondary, tertiary, or “no” 
pattern. The third judge was needed for ap- 
proximately five per cent of the judgments made. 


Results 


Permanence of Mean Scores. One way of 
measuring the permanence of scores is to de- 
termine the stability of means between ad- 
ministrations. This was done separately for 
the 111 metropolitan and 70 non-metro- 
politan boys. The means and variances of 
the standard scores for 44 occupational and 
3 non-occupational scales are given in 
Table 1.° 

The significance of the difference between 
the test and retest means for each key was 


8 Table 1 has been deposited with the American 
Documentation Institute. Order Document No. 4239 
from the ADI Auxiliary Publications Project, Photo- 
duplication Service, Library of Congress, Washington 
25, D. C., remitting in advance $1.25 for 35 mm. 
microfilm or $1.25 for 6 X 8 in. photocopies. Make 
checks payable to Chief, Photoduplication Service, 
Library of Congress. 
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tested by means of the t test, taking the 
correlation into account. Since the F test 
showed that the variances were significantly 
different in some cases, the assumption of ho- 
mogeneity of variances did not hold in all 
cases; this is indicated in Table 1. 

There was a significant difference between 
the means of the two administrations at the 
.01 level on 24 scales for the non-metropoli- 
tan group and on 26 scales for the metropoli- 
tan group. Twenty-two of these scales were 
common to the two groups. The significant 
changes in means between the test and retest, 
as shown in Table 1, were in both the posi- 
tive and the negative direction. The direc- 
tion, in all instances where there was a sig- 
nificant difference between administrations at 
the .01 level, was the same for the metropoli- 
tan and non-metropolitan groups. 

The interest group which showed the larg- 
est and most consistent changes was Group V. 
All the scales within this group showed a sig- 
nificant increase in mean score for both the 
metropolitan and non-metropolitan boys. Of 
the non-occupational scales, Interest Maturity 
was the only one which changed significantly. 
As would be expected, the mean on this scale 
increased for both the metropolitan and non- 
metropolitan boys. 

To determine whether or not there was a 
difference between the metropolitan and non- 
metropolitan groups with respect to stability 
of mean scores the means of the difference 
scores (test minus retest) were compared for 
each scale. When the variances were found 
to be homogeneous by means of the F test, 
the t test was used. When the variances 
were not homogeneous, the approximate 
method proposed by Cochran and Cox to 
test the hypothesis of equality of means with 
no hypothesis about the population variance 
was used (2). None of the differences be- 
tween the means of the difference scores of 
the two groups were significant at the .01 
level; three (Aviator, Vocational Agriculture 
Teacher, and Sales Manager) were found to 
be significantly different at the .05 level. 

Test-Retest Correlation. The test-retest 
scores for each of the scales were plotted and 
in all cases the relationship between test and 
retest appeared, by inspection, to be linear. 
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Table 2 
Correlations Between Two Administrations of the Voca 
tional Interest Blank for 111 Metropolitan 
and 70 Non-Metropolitan Males 





Non- 
Metro-  Metro- 
Interest politan — politan 
Group Scale Males Males 
I Artist 72 71 
Psychologist 77 67 
Architect 75 68 
Physician 75 64 
Osteopath 71 63 
Dentist 71 .60 
Veterinarian 7 61 
II Mathematician 59 .66 
Physicist 78 69 
Engineer .78 79 
Chemist 79 71 
Ill Production Manager 62 76 
IV Farmer 85 76 
Aviator 18 81 
Carpenter .80 Jt 
Printer 61 .60 
Math. Phys. Science Tchr. a2 61 
Industrial Arts Tchr. 79 74 
Voc. Agri. Tchr. 81 .68 
Policeman 69 .60 
Forest Service Man 80 .66* 
V = YMCA Physical Director 72 .66 
Personnel Director .60 ‘49 
Public Administrator 62 A5 
YMCA Secretary 62 .60 
Soc. Science H. S. Tchr. 68 .69 
City School Supt. 70 .69 
Minister .67 70 
VI Musician .68 76 
VII CPA 63 .60 
VIII Senior CPA 65 62 
Accountant 72 .66 
Office Man .66 67 
Purchasing Agent 42 17 
Banker 75 .64 
Mortician 74 74 
Pharmacist 54 59 
IX Sales Manager 17 .62 
Real Estate Salesman oe .60 
Life Insurance Salesman 79 .69 
X Advertising Man 75 .68 
Lawyer 73 73 
Author-Journalist 70 74 
XI __s—~Prresident Mfg. Concern .67 .60 
Interest Maturity .66 61 
Occupational Level 70 | 
Masculinity-Feminity .76 .87 





* Difference significant at the .05 level. 








Permanence of Strong Vocational Interest Blank Scores 


Product moment correlation coefficients were 
then computed between the test and retest 
for each of the scales. These were computed 
separately for the metropolitan and non- 
metropolitan group. The correlations are 
given in Table 2. 

None of the observed differences between 
the test-retest correlations for the metropoli- 
tan and non-metropolitan groups were found 
to be significant at the .01 level. Only one, 
that for Forest Service Man, was found to 
be significant at the .05 level. 

Permanence of Letter Grade Scores. To 
get a measure of permanence in terms of let- 
ter grade scores, the change in letter grades 
between test and retest was determined. For 
each Strong scale a tabulation was made of 
the letter grade obtained on the retest for 
each letter grade received on the test. Since 
the comparisons of mean scores and of cor- 
relations for the metropolitan and non-metro- 
politan groups indicated that the two groups 
were similar with respect to permanence of 
scores, the metropolitan-nonmetropolitan clas- 
sification was not retained for this part of 
the study. The two groups were pooled and 
treated as a single sample. 

Table 3 gives the amount of change in let- 
ter grades between test and retest when all 
occupational scales are summed. Because of 
space limitations a breakdown by individual 
keys is not included here. For such a de- 
tailed breakdown see Stordahl (4). 

A chi-square test for independence of letter 
grade and permanence was made by summing 
the letter grades over all scales and classify- 
ing the change in letter grades between test 
and retest into two categories—‘identical” 
and “not identical.” The hypothesis of inde- 
pendence of permanence and letter grade was 
rejected (P < .001). Table 3 indicates that 
on the average, C grades were the most stable, 
68 per cent of the C grades on the first test 
being C grades on the second test. The sec- 
ond most stable letter grade was A, with 60 
per cent of the letter grades being identical 
on the test and retest. The intermediate let- 
ter grades were less stable. By combining 
the letter grades so that B included B +, B, 
and B —, and C included C + and C, it was 
found that 73 per cent of the C grades on the 


Table 3 


Change in Letter Grade Scores on the Vocational 
Interest Blank for 181 Boys Tested as High 
School Seniors and Retested Two Years 
Later as College Students 


Test Retest 


Letter y 4 oT 
Grade N 5 B B+ 


A 804 ; 19 60 
B+ 761 3 21 26 30 
B 1,106 8 23 20 «17 
B— 1,394 12.9 
C+ 1,300 : k ss 
¢ a as 


test remained C grades on the retest and that 
59 per cent of the A and 59 per cent of the 
B grades remained constant. 

Permanence of Interest Patterns. The 
permanence of interest patterns over the two 
year period is summarized in Table 4. Here, 
as for the letter grades, the data are pre- 
sented for the metropolitan and non-metro- 
politan groups combined and for all interest 
groups combined. 

Chi-square was used to test the independ- 
ence of interest pattern and permanence by 
summing the patterns over all groups and 
classifying the change between tests as 
“identical” and “not identical.” The hy- 
pothesis of independence was rejected (P < 
001). 

As can be seen from Table 4, the primary 
and “no pattern” patterns were found to be 
the most stable with 58 per cent of the pri- 


Table 4 
Change in Interest Patterns on the Vocational Interest 
Blank for 181 Boys Tested as High School 
Seniors and Retested Two Years 
Later as College Students 


Retest 


%N %T MS %P 


Pattern N Total 


P 229 K 12 17 5! 100 
S 210 20 28 100 
7 274 . 19 16 100 
N 1,278 ; 9 6 100 
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mary patterns on the first test being pri- 
mary patterns on the retest and 81 per cent 
of the “no patterns” being identical on the 
retest. The secondary and tertiary patterns 
were less stable. 

Permanence of Individual Profiles. The 
stability of individual profiles was deter- 
mined for each of the 181 boys. Kendall’s 
(3) coefficient of concordance, W, was used 
as a measure of stability. The coefficient, W, 
is based on the method of ranks. It is re- 
lated to Spearman’s rho. The 44 occupa- 
tional scales were used in computing this co- 
efficient; the non-occupational scales were not 
included. When there were ties in rank, the 
median rank was assigned to each of the tied 
scales. 

The median coefficient of concordance for 
the metropolitan group was .87 and the 
median for the non-metropolitan group was 
.86. Since W has a direct relationship to 
Spearman’s rho, these figures can also be ex- 
pressed in terms of rho; the median rhos be- 
ing .74 and .72. All but nine of the coeffi- 
cients for the metropolitan group and six for 
the non-metropolitan group were found to be 
significantly greater than zero. 

The homogeneity of the frequency distribu- 
tions of coefficients of concordance for the 
metropolitan and non-metropolitan groups 
was tested by the Brandt-Snedecor chi-square 
method (2). The two distributions were 
found to be homogeneous (P > .05). 


Discussion 


In this study a substantial relationship was 
found to exist between the interest scores re- 
ceived as high school seniors and as college 
sophomores. This relationship was, however, 
far from being a perfect one and large indi- 
vidual differences in stability were evident. 

The scores of the metropolitan and non- 
metropolitan boys were quite homogeneous 
with respect to permanence of interest scores. 
The writer hypothesized that if the boys’ in- 
terests had not as yet become stabilized that 
some difference between these groups might 
be found. Assuming that interests are largely 
determined by one’s experiences, it was 
thought that the change in environment for 
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the non-metropolitan boys might cause a 
greater change in their scores than would be 
found for the metropolitan boys, whose en- 
vironment remained relatively constant. Such 
a difference was not found. 

No attempt has been made to make a di- 
rect comparison between the results of this 
study and the previous research of Strong 
and others. Such a comparison would be 
difficult since most previous research has been 
based on the original form of the blank 
whereas the revised form was used in the 
present study. However, since Strong (5, 6, 
7) has reported some data on permanence 
with the revised keys and since the original 
keys were very similar to the revised, some 
general comparisons can be made. 

The results of the present study, with re- 
spect to permanence as measured by mean 
scores, correlation, and permanence of indi- 
vidual profiles, are not greatly divergent 
from previous investigations. The results 
also support Strong’s conclusion that we can 
place the greatest confidence in C letter 
grade ratings. Theoretically, as Strong has 
indicated, the A and C ratings should be the 
most stable as they cover a wider range of 
scores than the B rating. The fact that the 
A ratings were found to be no more stable 
than the B can probably be accounted for by 
the relatively small number of A ratings and 
the tendency for them to be low A ratings. 

Although no previous studies have consid- 
ered permanence in terms of interest patterns, 
counselors may find this the best way to look 
at permanence of scores. The evidence indi- 
cates that the counselor can place the most 
confidence in primary patterns and “no pat- 
terns” since these are apparently more stable 
than secondary and tertiary patterns. This is 
especially true when no pattern exists in an 
interest group. 


Summary 


A sample of 181 males, 111 from a large 
metropolitan area and 70 from non-metro- 
politan areas, who had completed Strong’s 
Vocational Interest Blank as high school 
seniors were retested two years later as col- 
lege students. The tests and retests for the 








Permanence of Strong Vocational Interest Blank Scores 


181 boys were scored on 44 occupational 
keys and for Interest Maturity, Occupational 
Level, and Masculinity-Femininity. The test- 
retest scores on the 47 scales were compared 
in several ways to secure an estimate of the 
stability of scores over the two-year period. 
The following measures of test-retest sta- 
bility were used: permanence of mean scores, 
test-retest correlation, permanence of letter 
grade scores, permanence of interest patterns, 
and stability of individual profiles. 

A substantial relationship was found to 
exist between the interest scores received as 
high school seniors and as college sophomores. 
The metropolitan and non-metropolitan boys 
were quite homogeneous with respect to 
permanence of Strong Scores. 


Received December 30, 1953. 
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Although there is evidence that scores on 
the Kuder Preference Record differentiate 
among occupational groups (e.g., 2, 3, 5), 
few investigators have attempted to deter- 
mine the relationship between such scores 
and the occupation entered at a later date. 
Barnette (1) presents data indicating that 
Kuder scores are related to occupational satis- 
faction several years after counseling, but his 
subjects were young adults at the time of ad- 
visement. Since many counselors deal with 
adolescent youth, it is of some interest to find 
out whether Kuder scores obtained during the 
late adolescent period are related to actual 
occupations entered subsequently. 

The present paper reports a follow-up study 
of boys counseled at the Cleveland Jewish 
Vocational Service during the years 1943, 
1944, and 1945. This study attempted to 
determine whether a relationship existed be- 
tween Kuder Preference Record (Form B) 
scores at the time of counseling and the oc- 
cupation engaged in at the time of the study 
(1952). 


Subjects 


Questionnaires designed to elicit information 
about current occupation were sent to 215 men 
who had been counseled and tested at JVS dur- 
ing the years 1943-1945. The mailing list in- 
cluded all eleventh and twelfth-grade males tested 
during that period and all tenth-grade males 
tested in 1943 and 1944. The original letter, a 
reminder postcard, and a follow-up letter yielded 
questionnaire returns from 58 per cent of the 
mailing list. Of the total group, six per cent 
were known to be in military service, seven per 
cent had moved to unknown addresses, and one 
per cent was deceased. No information of any 
kind was returned for 28 per cent of the group. 


1 This paper is based on a portion of a thesis sub- 
mitted in partial fulfillment of the requirements for 
the M.A. degree at Western Reserve University by 
the first-named author and supervised by the second 
author. 


In order to determine the existence of bias in 
the sample that returned questionaires, several 
comparisons were made between the respondent 
and non-respondent groups. The mean age of 
the 124 respondents at time of original testing 
was 16 years and 7 months and that of the 64 
non-respondents was 16 years 10 months, 3 
months higher. The difference is statistically in- 
significant (t= .39). Scores on the American 
Council Examination (High School Form, 1942) 
were available for 44 of the respondents and for 
23 of the non-respondents. The means. of the 
two groups were not significantly different for L, 
Q, or Total scores. (For Total scores, t = .88.) 
When mean scores on the separate scales of the 
Kuder were compared, only one difference was 
significant at the five per cent level. That dif- 
ference occurred between mean scores on the 
musical scale, and the respondents were signifi- 
cantly lower in measured musical interest. These 
comparisons suggest that the respondent group is 
a fairly unbiased sample of the total group to 
whom questionnaires were mailed. 

Further data on the respondent group were ob- 
tained from the questionnaires. In terms of edu- 
cational achievement, it is clear that our respond- 
ents are not representative of the general male 
population and are probably not representative 
of males seen at the agency. Of the respondents, 
79 per cent indicated that they had completed at 
least four years of college. Less than two per 
cent had not finished high school. This substan- 
tial educational attainment is probably due in 
part to the financial assistance given veterans 
during the period covered by the study, but it 
may also reflect the family educational aspira- 
tions held by the clients of this agency. 


Procedure 


The questionnaire consisted of three items ask- 
ing for identifying data, one item on amount of 
education, two items requesting data about pres- 
ent occupation and length of time it had been en- 
gaged in, four items dealing with job satisfaction, 
and one item about the estimated influence of 
JVS counseling on occupational choice. The 
data on job satisfaction are not analyzed in the 
present report, since few respondents indicated 
dissatisfaction with their current occupation. 
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The first step in the treatment of the data was 
to classify the reported occupations of the re- 
spondents. Using the Kuder manual (3) as a 
guide, the occupations were coded as belonging 
to one or more of the nine Kuder interest areas. 
For example, mechanical engineering was classi- 
fied as belonging to the mechanical, computa- 
tional, and scientific interest groups. In most 
cases the reported occupation was classified ac- 
cording to its listing in the Kuder manual. A 
subjective judgment had to be made in a small 
number of cases: T-V producer was classed as 
persuasive, artistic, literary, and musical; gradu- 
ate student in international relations was classed 
as persuasive, literary, and social service; busi- 
ness men, executives, and those who were self- 
employed were included in the persuasive inter- 
est group. The occupational interest classifica- 
tion was entered on three by five cards con- 
taining other data about the subjects, so that 
tabulation could be done directly from the cards. 

Seven cases were eliminated from the respond- 
ent group at this point. They were either un- 
employed or were in undergraduate college. The 
final group of respondents used in this study, 
then, numbered 117. 

For each Kuder scale, the total group of re- 
spondents was divided into two sub-groups: those 
in occupations belonging to that interest area 
and all others. Mean Kuder raw scores and 
standard deviations were computed for each of 
these sub-groups, and a ¢-test was applied to the 
differences between the means. 


Results 


Table 1 summarizes the results of the sta- 
tistical analysis. Taking the mechanical in- 
terest scale as an example, Table 1 reads as 
follows: Of the entire respondent group, 26 
were currently occupied in jobs involving me- 
chanical interest, and 91 were in jobs that did 
not require this interest. The mean mechani- 
cal interest score earned by the mechanically 
occupied group, seven to nine years earlier, 
was 81.6. The mean score of those now in 
other kinds of jobs was 69.5. The difference 
between the mean scores of the two groups is 
significant at the five per cent level of confi- 
dence as shown by the ¢ value of 2.46. 

Only mean scores are presented for the ar- 
tistic and musical scales, since few subjects 
reported occupations involving these interests. 
On both of these scales, however, the differ- 
ences are in a direction consistent with those 
found for the other scales. 

For six of the remaining seven scales, the 
data show that men currently engaged in oc- 


Table 1 


Comparisons of Kuder Preference Record Scores Made 
Seven to Nine Years Ago by Men Engaged in 
Occupations Related to an Interest Area 
and by Those Engaged in Other 
Occupations 





Kuder Scale N M 








Mechanical 
Occupied 26 
Others 


Computational 
Occupied 
Others 


Scientific 
Occupied 
Others 


Persuasive 
Occupied 
Others 


Artistic 
Occupied 
Others 


Literary 
Occupied 
Others 


Musical 
Occupied 
Others 
Social Service 
Occupied 
Others 107 
Clerical 
Occupied 24 
Others 93 


59.0 
49.7 





* Significant at the 5 per cent level of confidence. 
** Significant at the 1 per cent level of confidence. 


cupations involving those interests made sig- 
nificantly higher mean scores than did men 
in other occupations. 

The failure of the men engaged in social 
service occupations to show a significantly 
higher social service score than those in other 
occupations probably reflects an inadequacy 
of our sample rather than an inadequacy of 
the scale. Eight of the ten cases classed as 
engaged in a social service occupation were 
students in professional school. Only one of 
these was in a graduate school of social work. 
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Three others were engaged in the graduate 
study of liberal arts subjects, three were in 
medical school, and one was in dental school. 
The composition of this sub-group, then, does 
not provide a satisfactory sample of persons 
in this occupational interest area. 

These results provide evidence that inter- 
est scores on the Kuder Preference Record 
are positively related to occupations entered 
seven to nine years later. Further, they indi- 
cate that interests have been sufficiently or- 
ganized by the time the last few years of high 
school are reached to provide one basis for 
estimating future occupational activity. 


Discussion 


Several considerations should be kept in 
mind in interpreting the results of this study. 
In the first place the interest test was ad- 
ministered as part of a total counseling serv- 
ice. Aptitude and achievement tests were 
used along with interest tests to provide a 
basis for personal interviews. It could be 
argued that the decisions arrived at during 
the entire counseling process largely deter- 
mined the occupation entered seven to nine 
years later. If the Kuder scores influenced 
counseling decisions, then the relationship 
found in this study could be due, not to the 
persistence of adolescent interests, but to the 
persisting effects of counseling based on ado- 
lescent interests. While our data cannot set- 
tle this issue definitively, several facts argue 
against the belief that counseling decisions 
alone can account for the relationship be- 
tween measured interests and occupational 
entry. For one thing, Strong (4) has pre- 
sented findings that show persistence of inter- 
ests over a long period of time. His original 
test data were apparently not collected in a 
counseling situation, so that the persistence 
of interests he found could not be attributed 
to the counseling process. For another thing, 
our respondents themselves did not attribute 
a great deal of influence to the decisions ar- 
rived at in the counseling process. When 
asked whether the suggestions made by the 
JVS influenced their occupational plans, only 
35 per cent said “yes,” 44 per cent said “no” 
and 20 per cent could not recall any influ- 
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ence. Although the recall of counseling in- 
fluence is not the sole valid measure of the 
existence of influence, our data certainly do 
not support the view that it determines occu- 
pational entry to a greater extent than the 
persistence of interests. 

A further consideration in interpreting our 
findings concerns the effect of military serv- 
ice and government assistance to veterans in 
school. Perhaps military service had the ef- 
fect of disrupting the normal peacetime paths 
to occupational entry. Our results would 
then show the minimal relationship between 
adolescent interest and later occupation. On 
the other hand, our respondents may have 
been enabled to enter preferred occupations 
to a greater extent than is usually true, be- 
cause the veterans’ benefits helped them to 
continue their education. The most that can 
be said on this point is that our findings need 
to be supported by data collected during a 
period free from the special influences cre- 
ated by wartime mobilization and a postwar 
economy. 


Summary and Conclusions 


In order to discover whether a significant 
relationship existed between adolescent inter- 
ests and later occupational choice, a question- 
naire was mailed to 215 men who had been 
counseled seven to nine years earlier during 
the latter portion of their high school careers. 
Usable information on current occupation was 
obtained from 117 of those on the mailing 
list. Comparisons of the respondents with 
the non-respondents indicated no difference 
with respect to age at time of counseling, in- 
telligence, and mean scores on eight of the 
nine Kuder Preference Record scales. Re- 
ported occupations were classified in accord- 
ance with the interests they involved as pre- 
sented in the Kuder manual. 

For six of the Kuder interest areas, men 
currently engaged in a related occupation 
made significantly higher scores seven to nine 
years ago than did men engaged in unrelated 
occupations. The three remaining interest 
areas (artistic, musical, and social service) 
did not yield clear-cut results because of the 
inadequacies of the sample. 
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We conclude that interests measured by 2. Hahn, M. E. and Williams, Cornelia T. The 
the Kuder Preference Record in adolescence measured interests of Marine Corps Women 
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The Degree to Which Colors (Hues) Are Associated with 
Mood-Tones ' 


Lois B. Wexner 
Division of Education and Applied Psychology, Purdue University 


The literature is replete with statements 


concerning the relation of color and emotional 


states or feeling-tones, but there is a dearth of 
experimental investigation to support these 
statements.. In a recent study by Odbert, 
Karwoski, and Eckerson (8), regarding the 
associations of color and mood, it was found 
that some colors were more often chosen to 
go with certain groups of words describing 
mood, such as red with exciting, orange with 
gay, yellow with playful, green with leisurely, 
blue with tender, purple with solemn, and 
black with sad. Two shortcomings of this 
study, however, are first, that the groups of 
words (represented above by exciting, gay, 
etc.) included words which could in no way 
be considered to mean the same thing, such as 
the “playful” list, which included humorous, 
whimsical, fanciful, quaint, sprightly, delicate, 
light, and graceful. Thus one subject may be 
reacting to one particular word in the list, and 
another, to an entirely different one. And, 
second, a partially “forced” method was used 
to fit the moods to a color-circle (arranged 
according to wave-length). For instance, gay 
is reported to “go with” orange, but in reality 
orange was chosen only 16 times, whereas 
red was chosen 62 and yellow 27 times. Fur- 
ther, the authors’ judgment appears to be the 
only method used to choose which colors went 
with which moods. Thus, although the nu- 
merical results are published, a clear-cut sta- 
tistical interpretation is lacking. Other studies 
(1, 2, 3, 9, 10, 13) report the association of 
color and moods, as determined by various 
methods including objective impressions, clini- 
cal observation, and introspection. 


Purpose 

The purpose of this investigation is to de- 
termine to what degree colors (hues) are as- 
sociated with mood-tones. The hypothesis to 
be tested is that there is a positive relation 
between certain colors and mood-tones. 

1 Grateful acknowledgment is made to James A. 
Norton, Jr., for his helpful suggestions in the use of 


statistical techniques, and to Joyce Block, Malcolm 
Robertson, and Henry Wexner, who served as judges. 
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Procedure 


The mood-tones used in this experiment were 
arbitrarily selected as a fairly representative 
group. Originally, twelve words were chosen, 
ie., exciting, secure, distressed, tender, protec- 
tive, despondent, calm, dignified, cheerful, defiant, 
powerful, and sensuous. Then a list of 164 
adjectives was prepared, including moods re- 
ported in the literature, synonyms of those words 
as well as those listed above, and other words the 
writer believed might be useful. The original 
twelve words were presented to four judges, with 
the list of adjectives. The judges (two of whom 
were male and two female) were requested to 
choose words from the list of adjectives which 
they felt meant the same as the “mood-tone” 
words. They were allowed to use words more 
than once if they wished. Then, the mood-tone 
words were listed together with their synonyms 
as unanimously chosen by the four judges. Since 
the judges did not agree on the meaning of 
sensuous, this word was not included in the ex- 
periment. The final groups of mood-tones are as 
follows: exciting, stimulating; secure, comfort- 
able; distressed, disturbed, upset; tender, sooth- 
ing; protective, defending; despondent, dejected, 
unhappy, melancholy; calm, peaceful, serene; 
dignified, stately; cheerful, jovial, joyful; defiant, 
contrary, hostile; and powerful, strong, masterful. 

The subjects consisted of 94 students in a 
course of beginning General Psychology, of which 
48 were female and 46 were male. The subjects, 
in three groups, were presented with an instruc- 
tion sheet containing the word groups as above 
and the following directions: 


The following groups of words are meant to 
represent feelings, or mood-tones. It is thought 
that certain colors tend to “go with” various 
mood-tones, and this is an attempt to deter- 
mine to what extent this may be true. Please 
select the one color, of the colors on the charts, 
that you feel best represents the feelings de- 
scribed by the following word groups. All the 
colors need not be used, and colors may be 
used more than once. Be sure to select a 
color for each group, even though it may seem 
difficult to find a color to fit the mood-tone. 
Usually your first impression would be the 
best one, if in doubt. 


Eight colors, yellow, orange, red, purple, brown, 
blue, black, and green, in the form of 8} X 11 
inch pieces of art paper mounted on 30 X 40 
inch pieces of light-gray cardboard, were ran- 
domly arranged at the front of the room. It 
should be noted that no mention of color names 
was made by the experimenter. This was in 
order to avoid associations to color stereotypes 
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and to assure that the colors as chosen were the 
particular shades presented to the subjects, in an 
attempt to insure uniformity of shade. It might 
further be noted that there was no difficulty in 
determining which colors the subjects intended 
to indicate. 

Chi-square tests for any possible sex differ-- 
ences in color association to mood-tones were 
made. No significant differences were found to 
exist. 

Since there were no significant sex differences, 
the frequencies from the two sexes were com- 
bined into one set for further study. Then, for 
each mood-tone, a chi-square test was made to 
tést whether or not the colors differed signifi- 
cantly in frequency of association with that 
mood-tone. These chi-squares were significant in 
all cases. (A five per cent significance level was 
used.) Thus it is demonstrated that some colors 
are more often associated with a given mood- 
tone than others. 

The next step was to determine which par- 
ticular colors were most often associated with a 
given mood-tone. For this purpose, Tukey’s 
(14) procedure for accomplishing multiple com- 
parisons among a set of observed means was 
adapted to make multiple comparisons among a 
set of observed frequencies in mutually exclusive 
categories (7). The essential nature of this 
adaptation was to use the inverse sine transfor- 
mation upon the observed proportions. The 
error variances of such transformed proportions 
are given by the theory of the transformation 
(4). In all cases a significance level of five per 
cent was used. 


Results 


The following results were obtained. For 
each mood-tone, the colors are grouped (A, 
B, C, etc.) according to the results of the 
multiple comparisons tests. The interpreta- 
tion of these groups is as follows: colors in 
the same group are associated with the meod- 
tone significantly more often than colors in 
groups below them, and significantly less often 
than colors in groups above them. Colors in 
the same group do not differ significantly 
from each other in frequency of association 
with the mood-tone. 


Exciting, stimulating. 


Group Color 

A Red 

B Yellow 
Orange 

Cc Green 
Purple 
Black 
Blue 
Brown 


Frequency 


Secure, comfortable. 


Group Color Frequency 
A Blue 41 
B Brown 23 
Green 18 
c Yellow 
D Orange 
Black 
Red 
Purple 


Distressed, disturbed, upset. 


Group Color Frequency 
A Orange 34 
B Black 16 
Cc Purple 
Brown 
Green 
Red 
Yellow 
Blue 


Tender, soothing. 


Group Color Frequency 
A Blue 41 
B Green 24 
_ bs Yellow 11 
Purple 
Brown 
Orange 
Black 
Red 


Protective, defending. 


Group Color Frequency 

A Red 21 
Brown 17 
Blue 15 
Black 15 
Purple 14 
Green 5 
Orange 
Yellow 


Despondent, dejected, unhappy, melancholy. 


Group Color 

A Black 
Brown 

B Purple 
Blue 
Green 
Yellow 
Orange 
Red 


Frequency 
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Group 
A 


B 


Group 
A 
B 
Cc 


D 


Group 


Cc 


Group 


Group 
A 
B 
c 





Calm, peaceful, serene. 


Color 
Blue 
Green 
Yellow 
Purple 
Orange 
Brown 
Black 
Red 


Dignified, stately. 


Color 
Purple 
Black 
Blue 
Brown 
Red 
Orange 
Yellow 
Green 


Cheerful, jovial, joyful. 


Color 
Yellow 
Red 
Orange 
Green 
Blue 
Purple 
Brown 
Black 


Defiant, contrary, hostile. 


Color 
Red 
Orange 
Black 
Brown 
Purple 
Yellow 
Green 
Blue 


Powerful, strong, masterful. 


Color 
Black 
Red 
Purple 
Blue 
Brown 
Orange 
Yellow 
Green 


Lois B. Wexner 
Discussion 
Frequency In general, the results of this investigation 
38 tend to support the color-mood studies as re- 
. ported in the literature. It should be noted, 
7 however, that the association of some mood- 
4 tones with certain colors is more clear-cut 
3 than others. For instance, in some cases one 
3 color “goes with” a mood-tone significantly 
0 more often than does any other color (of the 
particular shades of colors used in this ex- 
periment). Red is more often associated 
Frequency with exciting-stimulating, blue with secure- 
45 comfortable, orange with distressed-disturbed- 
30 upset, blue with tender-soothing, purple with 
9 dignified-stately, yellow with cheerful-jovial- 
6 joyful, and black with powerful-strong-mas- 
3 terful. On the other hand, there is no 
1 statistically significant difference between 
* certain colors in their association with cer- 
tain other mood-tones, such as red, brown, 
blue, black, and purple with protective-de- 
fending; black and brown with despondent- 
Frequency dejected-unhappy-melancholy; blue and green 
40 with calm-peaceful-serene; and red, orange, 
20 and black with defiant-contrary-hostile. 
14 Since there appears to be fairly consistent 
i agreement among the studies on this subject, 
. it is appropriate to suggest possible con- 
0 tributing factors, although it is not the pur- 
0 pose of this paper to investigate this par- 
ticular aspect of the problem. In addition 
to the cultural factor which no doubt plays 
F an important part in the associations of 
requency ° ° 
33 colors with certain mood-tones, there seems 
21 to be the possibility of the existence of bio- 
18 logical determinants. Guilford (6) states 
11 that experimental results “point very strongly 
9 to a basic communality of color preferences 
5 among individuals. This communality prob- 
5 ably rests upon biological factors, since it is 
2 hard to see how cultural factors could pro- 
duce by conditioning the continuity and sys- 
tem that undoubtedly exists.” Goldstein (5) 
F , is more explicit in setting forth physiological 
requency ° 
effects of color on the human organism, and 
48 ae P P 
2 indicates that patients, exposed to various 
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colors such as large sheets of colored paper, 
change the position of the arms in different 
directions, according to the color to which 
they are exposed; that color influences the 
speed of volitional movements; and that seen 
and felt distances and time intervals and 
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weights are judged differently under the in- 
fluence of different colors. He finds, further- 
more, that green favors performance in gen- 
eral, in contrast to red, and feels that these 
different effects correspond to very definite, 
but different, total behavioral attitudes, which 
find their expression very clearly in the sub- 
ject’s reports of the mood corresponding to 
the various colors. 

In addition to the part played by learning 
in the cultural and biological determinants of 
associations of colors with certain mood- 
tones, there may be an additional factor 
which should be included, in the form of par- 
ticular learning situations, which may affect 
individuals and/or groups. An experiment in 
support of this type of contribution was done 
by Staples and Walton (12). 

The foregoing are merely suggested as pos- 
sible contributing factors to color and mood 
association, and the need for additional ex- 
perimental work in this area is obvious. 

With regard to the present investigation, it 
would seem possible, and even likely, that in 
a similar experiment different results might 
be obtained if different shades of the same 
colors were used. For instance, in a discus- 
sion with a group of the subjects after the 
data had been collected, the writer men- 
tioned that she had expected purple to “go 
with” powerful. One of the subjects replied 
that the particular shade of purple was not 
deep and dark enough to be a “powerful” 
purple. Thus it would appear that extreme 
caution should be used in generalizing these 
findings to other shades of the same colors. 
However, because of the positive findings of 
this experiment, it would appear that useful 
information could be obtained by extending 
this type of investigation to other groups. 
Such information might possibly be of ex- 
tensive value to both industrial and clinical 
psychologists. 


Summary 


In an attempt to determine to what degree 
colors (hues) are associated with mood-tones, 
94 subjects were presented with eight stimu- 
lus colors (red, orange, yellow, green, blue, 
purple, brown, and black) and a list of eleven 
moods (exciting-stimulating; secure-comfort- 
able; distressed-disturbed-upset; tender-sooth- 
ing; protective-defending; despondent-de- 
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jected-unhappy-melancholy; _calm-peaceful- 
serene; dignified-stately; cheerful-jovial-joy- 
ful; defiant-contrary-hostile; and powerful- 
strong-masterful), the word selections of 
which had been unanimously agreed upon by 
four judges. No significant differences were 
found in color-mood association between male 
and female. It was found, however, that for 
each mood-tone certain colors were chosen to 
“go with” that mood-tone significantly more 
often than the remaining colors, and the re- 
sults were stated. 

Inasmuch as there was general agreement 
among studies concerning mood and color as- 
sociation, several possibilities for this were 
given, such being the influence of cultural, 
biological, and learning factors. 


Received December 3, 1953. 
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Readability of Mathematical Tables * 
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‘ 


Casual examination of several mathematical 
or statistical, tables will reveal great variation 
in the typographical arrangements employed. 
From table to table the reader may find 
variation in type size, type face, use of addi- 
tional leading at periodic intervals, number of 
decimal places employed, etc. To a reader 
with some background in scientific typogra- 
phy, it is obvious that some of these factors 
should influence the readability of the tables. 
Since a particular mathematical table may be 
put to a number of different uses by workers 
or students in a variety of scientific fields, it 
would seem that arbitrary choice of a specific 
typographical arrangement for use in a cer- 
tain field is not the most important factor to 
consider, particularly with tables of squares, 
cubes, square roots, and cube roots which are 
widely used. In some tables, economy of 
space seems to be the sole consideration with 
no attention to readability factors. Where 
readability (or legibility) is mentioned, as by 
Milne (3), choice of typography depends 
upon opinion rather than upon experimental 
findings. 

Actually, there has been no experimental 
work done on the readability of mathematical 
tables. A few related findings in specialized 
kinds of reading may be cited: Baird (2) 
studied the legibility of a telephone directory. 


He found that 4 point leading between lines _ 


was 13 per cent more efficient in terms of 
time taken to find a number than when set 
solid. He also found that indenting every 
other line in the directory increased only 
slightly (probably not significantly) the speed 
and accuracy of locating telephone numbers 
in comparison with an even alignment of 
names. Scott (5) had subjects read two 
pages of a railroad time-table, each set up in 
light-faced small type and heavy-faced large 
type. The large heavy-faced type was read 
faster and with considerably fewer errors. 

* The writer is grateful to the University of Min- 


nesota Graduate School for research grant to finance 
this study. 


Size of type rather than heaviness of type 
face may have been the important factor. 
After inspecting a number of mathematical 
tables, Babbage (1) expressed a preference 
for numerals of uniform height (modern) 
rather than those with ascenders and descend- 
ers (Old Style). In a report (4) of the 
Committee on Type Faces it is recommended, 
on the basis of collected opinions, that mod- 
ernized Old Style numerals be used in mathe- 
matical tables. Reading the Old Style nu- 
merals is considered to produce less fatigue. 
Milne (3) also considers the Old Style nu- 
merical symbols, in which most of the char- 
acters have heads or tails, to be more legible 
than those of uniform height. Tinker (6) 
determined (a) the relative visibility of 
Modern and Old Style numerals by obtain- 
ing the average distance from the eyes at 
which the numerals could be read correctly; 
and (b) the speed and accuracy of reading 
the two kinds of numerals. The Old Style 
numerals, read in isolation, were slightly more 
visible (probability at the 2 per cent level), 
but in groups were much more visible (prob- 
ability beyond the one per cent level). But 
Modern numerals in groups were read just as 
fast and just as accurately under normal 
reading conditions as the Old Style numerals. 
It was suggested that when numerals are 
printed in groups as in tables, that the Old 
Style numerals be used because they are per- 
ceived more easily (more visible). 

The above citations merely suggest what 
might be more satisfactory in terms of size of 
type, leading, and type style. Since no ex- 
perimenting with actual tabular materials has 
been done, the need for some exploratory in- 
vestigation seems indicated. The purpose of 
the present study is to investigate the com- 
parative readability of five mathematical 
tables in terms of the speed with which sub- 
jects can find the squares, square roots and 
cube roots of numbers. Tables were chosen 
which permitted comparisons between type 
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sizes, type faces, and arrangement of columns 
and rows of numerals. 


Materials and Procedure 


The five published tables will be designated by 
the letters: A, B, C, D, and E. For purposes of 
comparison we will need a rather complete de- 
scription of each table. Only tables which in- 
cluded squares, cubes, square roots and cube 
roots were used from each book (except Table A 
which did not include cubes and cube roots). 
Table 1 shows the columnar arrangements of the 
five mathematical tables. 

Table A has a 6 X 9 inch page. The numerals 
are printed in an 8 point Modern (all numerals 
same height) type set solid with successive groups 
of five entries down the columns separated by 8 
point leading. The first column (No.) is in bold 
face and the remaining numerals ordinary light- 
face. Decimals are carried to three places. In 
the square column, there is } pica space between 
each set of two numerals along a line. Columns 
are separated by a 1 pica space with no rule. 
The paper is a good quality mat white and thick 
enough so that shadows from print on the reverse 
side do not show through. 

Table B has a 5% X 84 inch page. The nu- 
merals are printed in an 8 point Old Style type 
(ascenders and descenders) set solid with succes- 
sive groups of five entries down the columns sepa- 
rated by 8 point leading. The first column (No.) 
is in bold face and the remaining numerals in 
ordinary face. Decimals are carried to seven 
places. In the square column, the numerals 
along a line are grouped in twos as in Table A, 
and by threes in the cube column. There is } to 
1 pica space plus a rule between various columns. 
The paper is good quality mat white and thick 
enough so that no shadows show through. 

Table C has a 3{ X 63 inch page. The numer- 
als are printed in a 6 point Modern type, set 
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solid with groups of 10 entries down the columns 
separated by 6 point leading. There is no bold 
face type in this table. Decimals are carried to 
four places. There are no groupings into twos 
or threes along lines in the squares and cubes. 
There is a 1 pica space with no rule between 
columns. The paper is a good quality mat white 
and thick enough so that no shadows show 
through. 

Table D has a 33 X 63 inch page. The numer- 
als are printed in a 6 point Modern type, set 
solid with no grouping of entries down the col- 
umns, but each fifth entry down a column’ is in 
bold face which is only a little darker than the 
rest of the printing. Decimals are carried to 7 
places in the square and cube roots. There are 
no groupings along lines into twos or threes in the 
squares and cubes. There is a 4 pica space plus 
a rule between columns. The mat grayish white 
paper is so thin that shadows from print on the 
reverse side show through enough to hinder dis- 
crimination of the numerals. The printed page 
impresses the reader as being crowded and diffi- 
cult to read. 

T-hle E has a 44 X 64 inch page. The numer- 
als ©:e printed in a 6 point Modern type, set 
solid with groups of 10 entries down the columns 
separated by 6 point leading. Numerals in the 
No. column are in bold face. Square root deci- 
mals are carried to four places, cube root to five 
places. There are no groupings along lines into 
twos or threes in squares and cubes. There is a 
4 to 1 pica space between various columns plus 
a rule. The mat grayish paper is so thin that 
disturbing shadows show through from the print 
on the reverse side but these shadows are not as 
prominent as in Table D. 

The typographical arrangements of these five 
tables permit a number of interesting readability 
comparisons: type face, A vs. B; type size, A vs. 
C; leading between grouping of entries down col- 
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Arrangement of Columns in Five Mathematical Tables 
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umns, D vs. E; arrangement of columns, various; 
etc. All tables have 50 entries per column ex- 
cept D which has approximately 75. 

The experiment was carried out in a labora- 
tory room with uniform illumination of 20 foot- 
candles. The subjects were 120 university stu- 
dents. 

There were two general procedures: 1. The 
book was opened to page one of the table and on 
presentation of the number, the subject found 
the entry, turning pages where necessary, and 
read off the response. 2. The book was opened 
to the page of the table containing the number 
involved. Upon presentation of the number, the 
subject found the entry and read off the re- 
sponse. 

This was done separately for the finding of 
squares, square roots, and cube roots. There 
were, therefore, six parts to the experiment. 
Twenty different subjects observed on each part. 
Materials (the five tables) and subjects were 
systematically rotated within each part to equate 
practice effects. 

Ten numbers were looked up in each table by 
each subject. Upon arrival at the laboratory the 
subject was allowed to look over the five tables 
to become acquainted with them. He was then 
told that he would be presented with a number 
and that he was to look up and read off aloud 
the square (square root, cube root) as rapidly as 
possible. Two practice trials were given on each 
table just before it was used. The number to be 
looked up was presented typed on a 3 X 5 inch 
index card. The ten numbers to be looked up 
included one from each hundred up to a thousand 
(as 86, 141, 216, 324, 434, 538, 663, 728, 836 and 
982). A different series of numbers was used for 
each table. Times were recorded in seconds and 
tenths of a second from presentation of the num- 
ber (uncovered on table before the subject) to 
the beginning of the spoken response. All errors 
were tabulated. No information about results 
was given to a subject until the experiment was 
completed. 


Results 


The basic data of this study are given in 
Table 2. Comparison of the lower with the 
upper half of the table reveals that much 
more time is taken to find squares, square 
roots and cube roots of a number when the 
book is opened at the beginning of the mathe- 
matical table (subject finds page) than when 
the book is opened to the page containing the 
item (subject given page). When mean scores 
for one-half of the subjects were compared 
with those for the other half the consistency 
of trends from mathematical table to table 
was high in each part of the experiment. 
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Table 2 


Mean Time in Seconds Taken to Locate Squares, 
Square Roots and Cube Roots in Five 
Mathematical Tables 


(N = 20 college students in each comparison, 120 in all) 


























Squares Square Root Cube Root 
Table Mean S.D. Mean S.D. Mean S.D. 
Subject Finds Page 
A 5.18 .61 5.04 49 — — 
B 5.24 .72 5.77 86 5.20.87 
Cc S52 .00 5.70 .99 5.90 86 
D 6.34 91 6.43 1.04 6.06 1.02 
E 5.49 .71 6.30 1.08 6.06 94 
Page Given Subject 
A 2.99 .40 2.90.37 - — 
B 2.74 .52 2.90  .66 2.77 57 
C 2.74 .54 292 62 2.96 .64 
D 3.71.61 3.50 .86 341 .69 
E 3.02 .66 3.06 .69 2.90.61 





Note: Original computations were carried to four 
decimal places. 


Correlation of the ranks obtained from these 
scores ranged from .80 to 1.00. Tabulation 
of the errors revealed a high percentage of 
accuracy. In only two per cent of the item 
responses were there errors. There was little 
difference in error count from table to table 
although they were somewhat fewer in mathe- 
matical Table A. We may, therefore, con- 
centrate our attention on the speed scores. 
In Table 3 are listed differences and criti- 
cal ratios for reading squares, square roots 
and cube roots of numbers when the subjects 
always started at page one of the mathemati- 
cal table. In finding squares of numbers, 
starting always at the beginning of the mathe- 
matical table, times were significantly faster 
in Tables A, B, C and E than in D. With 
less certainty, time for A was faster than for 
C. Note also that A was better than E, and 
that B was better than C and E although the 
differences were not significant. There was 
no important difference between A and B or 
C and E. (The whole pattern of differences 
will be coordinated below under discussion.) 
The data on significance of differences for 
finding square roots, starting at the beginning 
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Table 3 


Differences Between Means in Seconds with Critical Ratios for Finding Squares, Square Roots and 


Cube Roots in Five Mathematical Tables When Subject Finds Page 








Squares 


Tables 
Compared 


Diff. 


Square Roots 


Cube Roots 


Diff. 


CR: 





A vs. 
A vs. 
A vs. 
A vs. 
B vs. 
B vs. 4.24** 
B vs. i 1.10 

> vs. : 3.19** 
C vs. 0.13 

D vs. 3.29** 


+ .06 
+ 34 


0.29 
1.65 
4.74** 
1.48 
1.25 





* Significant at the 5 per cent level. 
** Significant at the 1 per cent level. 


of the mathematical tables, reveal that Table 
A was significantly superior to all other Ta- 
bles (B, C, D, E). With a lesser degree of 
significance (5 per cent level), B was better 
than D, and C than D. Also, B was better 
than E and C than E although the signifi- 
cance of the differences did not reach the 5 
per cent level. 

In locating cube roots, Table B was su- 
perior to C, D, and E. No other differences 
were significant. 

Turning now to the situations in which the 


mathematical tables were opened by the ex- 
perimenter to the page containing the item to 
be located, the data on significance of differ- 
ences for finding squares, square roots and 
cube roots are given in Table 4. In finding 
squares, Tables A, B, C and E are much 
better than D. B and C are somewhat better 
than A although not significant at the 5 per 
cent level. There is no important difference 
between A and E, or B and C. 

The data for significance of differences for 
square roots with proper page given to the 


Table 4 


Differences Between Means in Seconds with Critical Ratios for Finding Squares, Square Roots and 








Squares 


Diff. 


Tables 
Compared 





CLR. 


Cube Roots in Five Mathematical Tables When Subject is Given Page 


Square Roots Cube Roots 


Diff. 





C.R. CR. 





vs. —.25 
vs. C —.25 
vs. +.72 4.43** 
vs. +.03 0:17 
; 00 0.00 
+.97 5.42** 
+.28 1.49 
+.97 5.32" 
+.28 1.47 
— .69 3.45** 


1.71 
1.66 


DAABWDAS PPS 


.00 0.00 
+.02 0.12 
+.60 
+.16 
+.02 


0.92 
0.10 


0.75 


0.68 





* Significant at the 5 per cent level. 
** Significant at the 1 per cent level. 
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subject reveal that Tables A, B and C were 
much better than D. There were no other 
significant differences. 

Data on significance of differences in look- 
ing up cube roots, page given, indicate that 
Tables B and E were considerably better than 
D. Other differences are unimportant. 


Discussion 


The results obtained when the subject al- 
ways started from the beginning page of the 
mathematical tables will be considered first. 
This is the kind of situation encountered by 
the reader in his customary use of tables of 
this kind. In practically every instance, a 
significantly greater time was required to lo- 
cate squares, square roots and cube roots in 
Table D than in the other tables. In Table 
D, the type was smaller than in A or B; in 
D there was no additional leading to separate 
groups of items down columns to aid in fol- 
lowing across rows in contrast to A, B, D and 
E; also, the paper was thinner in D than in 
A, B and C. It is possible that the arrange- 
ment of columns in D retarded finding the 
proper entry since it was necessary to skip 
over the first two columns next to the No. 
column. However, this may not be impor- 
tant since an analogous situation occurs for 
some entries in B and E. The fact that E is 
better than D for finding squares is probably 
due to the additional leading which separates 
groups of items down columns. 

Keeping in mind that in A and B the spac- 
ing and type size was the same (8 point), 
and that the squares column was next to the 
No. column in both, the lack of difference in 
finding squares suggests that style of type 
face is unimportant (Modern vs. Old Style). 
The significant difference in favor of A over 
B in finding square roots is undoubtedly due 
to the arrangement of columns. In A, the 
square roots were next to the squares (third 
column), but in B they were in the sixth 
column. 

In Tables C and E, the type size, spacing 
down rows, arrangement of columns, and 
spacing between columns were alike or very 
similar. This probably explains the lack of 
difference in finding squares and in finding 
cube roots. 
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Lack of difference between D and E in find- 
ing square roots and cube roots seems to be 
due to the fact that the typography is similar 
in both: thin paper, intercolumnar spacing 
and rules, 6 point type, and arrangement of 
columns. Reason for the slight superiority 
of E over D in finding squares is not clear. 
Perhaps column arrangement was a factor, for 
the squares were in the third column in D. 

The factors which favor readability of 
mathematical tables, when the reader starts 
at the beginning of each table may be 
summed up as follows: Improved readability 
is achieved by larger type (8 vs. 6 point), by 
using at least 1 pica space between columns 
with no rules (rules probably do not lessen 
readability provided there is adequate space 
between columns, e.g., Table B), by separat- 
ing items down a column into groups of five 
or ten by leading equivalent to the type size 
used, by a favorable arrangement of columns 
across page (as No., square, square root, 
cube, cube root when one is interested mainly 
in squares, cubes and roots), and by use of 
mat white paper thick enough to prevent 
shadows showing through from print on the 
reverse side. 

When the reader began at page one in find- 
ing squares and square roots, as discussed 
above, the order of the mathematical tables 
from most to least readable was found to be 
A, B, C, E, D. Table A was by far the best 
(errors as well as time) and D was by far 
the least readable. 

When we consider readability uncompli- 
cated by turning pages (given page), the pic- 
ture is similar but not the same. Tables A, 
B, C and E are much more readable than D. 
The main differences in typography common 
to A, B, C and E in contrast to D are the 
grouping of items down columns by inserting 
leading after every fifth or tenth entry, and 
more adequate spacing between columns. In 
addition it should be noted that A and B have 
8 point type and are printed on thick paper 
in contrast to D which is in 6 point type on 
thin paper. The slight superiority of B and 
C over A is difficult to understand. Table B 
is in 8, and C in 6 point type while A is in 
8 point. The only typographical difference 
common to B and C in contrast to A is that 
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in B and C there is only one set of columns 
present (50 No. items) per page while in A 
there are three sets of columns of 50 items 
each in the No., square, and square root col- 
umns, i.e., 150 successive items per page. It 
is possible that the need to locate the proper 
No. column as well as the numeral hindered 
the reader somewhat in Table A. It is un- 
likely that the Old Style numerals in B vs. a 
modern style in A was important although 
this should be noted. 

The lack of any difference between A and 
E must have a similar explanation. The ad- 
vantage of 8 point type in A vs. 6 point in E 
seems to be nullified by the need to identify 
the correct No. column as well as the desired 
numeral in A. The reason for lack of any 
difference between B (8 point) and C (6 
point) is not clear. Two typographical dif- 


ferences should be noted: 1. In C the columns 
are separated by a 1 pica space while in B 
they are separated by a 1 pica space plus a 
rule. 2. In C the column entries are grouped 
in tens while in B they are grouped by fives. 
Grouping by tens may have an advantage. 
It is possible that the intercolumnar spacing 


and the grouping by tens in C offset the ad- 
vantage of the larger type in B. 

The slight superiority of B over E must be 
in the quality of the paper (thick vs. thin) 
and size of type (8 vs. 6). And the slight 
superiority of C over E may be due to quality 
of paper plus perhaps the space rather than 
rules between columns. 

There are fewer significant differences be- 
tween the mathematical tables for finding 
square roots with page given. A, B, and C 
are better than D. The essential typographi- 
cal differences between D and the other 
tables are lack of grouping down columns, 
and less adequate spacing between columns 
plus perhaps quality of paper. 

The pattern of differences when looking up 
cube roots is similar to that for square roots. 
B and E are better than D. Other differ- 
ences are not important. Apparently the 
same typographical factors are operating as 
in looking up square roots. 

The above discussed factors which favor 
readability of mathematical tables when the 
reader is given the page on which the num- 
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ber appears (no turning of pages) may be 
summarized as follows: grouping of items 
down columns by inserting ample leading 
(grouping by tens may be better than by 
fives), use of one set of columns per page, 
use of at least a 1 pica space between columns 
without rules, use of white mat paper thick 
enough so that shadows from print on the re- 
verse side do not show through, plus perhaps 
size of type. Variation in style of type face 
seems unimportant. Apparently the sugges- 
tions by Milne (3), Tinker (6) and others 
(1, 4) that Old Style numerals should be 
more legible than a modern face when used in 
tables do not hold. 

When the page on which a number was to 
appear was given the reader in finding squares 
and square roots, the order of the mathemati- 
cal tables from most to least readable was 
(roughly) B, C, A, E, D. The difference be- 
tween B and C was small. 

This experiment was designed as a pre- 
liminary investigation of the readability of 
mathematical tables. Obviously, the experi- 
mental design is imperfect. There are too 
many variables involved. One variable at a 
time should be studied, or a design should be 
employed that permits isolation of the vari- 
ance due to each variable. Nevertheless, the 
present results suggest that the more im- 
portant typographical factors favoring good 
readability are use of at least 8 point type, a 
favorable arrangement of columns, at least 1 
pica space between columns without rules, 
ample leading to separate entries down col- 
umns into groups of five or ten, and paper 
thick enough to prevent shadows showing 
through from print on the reverse side of page. 


Summary 


1. The purpose of this experiment is to in- 
vestigate the influence of certain typographi- 
cal variations upon the readability of mathe- 
matical tables. 

2. Times in seconds to look up squares, 
square roots, and cube roots were obtained: 
(a) when subjects always started with page 
one of the tables; and (b) when the page 
containing the number sought was given. 

3. Twenty adult subjects served for each 
of the six parts to the experiment, 120 in all. 
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4. Five mathematical tables representing a 
wide range of typographical variations were 
used. 

5. The results of this experiment revealed 
certain typographical factors which promote 
more effective readability as well as certain 
conditions that should be avoided. The evi- 
dence educed here suggests the following as 
an effective typographical arrangement for 
mathematical tables: 


a. Do not crowd an excessive number of 
columns into a table. This is apt to occur in 
general purpose tables which include such 
things as reciprocals, areas, etc. in addition 
to squares, cubes and roots. 

b. Use only one set of columns per page 
with about 50 entries per column. 

c. Use at least 8 point type, either Old 
Style or a modern face. 

d. Employ generous leading to separate 
numerals into groups of five down columns, 
and then show grouping into tens by an un- 
derline below each tenth row or by some 
other technique which will be easily noted. 

e. Use at least 1 pica space between col- 
umns without rules. 
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f. Use bold face printing in the No. col- 
umn, 

g. Use paper thick enough so that shadows 
from print on the reverse side of the page do 
not show through. 

h. Use mat white paper and jet black ink 
to assure maximum contrast between ink and 
paper. 

Received December 8, 1953. 
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Research designed to ascertain relative pref- 
erences for variations of a food product is 
essentially research on the judgmental proc- 
ess. Because of this the investigator must 
draw upon psychophysics in deriving his 
methodology (4). The psychophysicai meth- 
ods available fall into two general categories 
—the methods of comparative judgment and 
the method of single stimulus (sometimes re- 
ferred to as the method of absolute judg- 
ments). All methods of comparative judg- 
ment are alike in that they require the Ss to 
make direct comparisons of the items in one 
session. In the method of single stimulus the 
Ss judge an item without a specific compari- 
son stimulus being present. 

In food preference research using the 
method of comparative judgment two varia- 
tions are frequently employed—paired com- 
parisons and rank order. When the number 
of items being investigated is three, the 
method of paired comparisons has proved 
efficient (1). With four items the number of 
pairings is too great for a taste test confined 
to one session per S; in this instance the 
method of rank order has been used (2, 3). 
It can be argued, however, that any compara- 
tive judgment procedure is not realistic from 
the standpoint of actual consumer behavior. 
How often does the consumer make compara- 
tive judgments of the type involved in paired 
comparisons or rank order experimental de- 
signs? The consumer’s situation, in which he 
uses the product without a comparison item 


present, is a duplication of the method of, 


single stimulus. This being so, a critical 
question becomes—what is the nature of the 
ordering of food items, with respect to pref- 
erence, by the two general procedures? The 
present research is directed toward this ques- 
tion. 


1The authors are indebted to Dr. Forrest E. 
Clements, of the Bureau of Agricultural Economics, 
for his assistance in this research. The canned orange 
juices were provided by the Florida Experiment Sta- 
tion and the Bureau of Agricultural Economics. 
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The food items used were four canned 
orange juices that varied in °Brix and in 
Brix-acid ratio. °Brix is a measure of the 
specific gravity of sugar solutions; the Brix- 
acid ratio is a measure of the relation be- 
tween the °Brix and the amount of acid in 
the solution. Changes in ° Brix or Brix-acid 
ratio are correlated with changes in the tart- 
sweet quality of the orange juice; the higher 
the °Brix or Brix-acid ratio the sweeter the 
juices taste. 


Experimental Design 


Subjects and Materials. The Ss were 120 in- 
dividuals 17 years of age and over (68 men; 52 
women). The Ss were randomly assigned to the 
various experimental groups. Each S$ was tested 
individually. 

The four canned orange juices used were: I. 
13.2 °Brix; 18.3 Brix-acid ratio; II. 13.4 °Brix; 
9.9 Brix-acid ratio; III. 9.0 °Brix; 18.3 Brix-acid 
ratio; and IV. 9.3 °Brix; 9.9 Brix-acid ratio. 

Variables such as variety of orange and peel- 
oil content were constant. 

The juices were kept under refrigeration so 
that they were always chilled when used in a test. 
The juices were served in non-waxed paper cups. 


Procedure. One-half of the Ss first judged 
the four juices using the method of rank order 
end, after a drink of water and a rest period, 
rated one of the juices under single stimulus 
conditions. The other half of the Ss reversed 
this procedure. The particular juice judged 
under single stimulus conditions was assigned 
to the Ss in a random manner—one-fourth 
of the Ss judging a given juice in this par- 
ticular manner. 

The procedure for the method of rank order 
was as follows. The four juices were placed 
in a row in front of the S. The original po- 
sition of the juices varied randomly from S 
to S. The S tasted the juice on the extreme 
left and placed it in front of the other three. 
The juice now on the left in the original row 
was tasted and placed to the left or right of 
the first one in terms of, “I like this one 
better” (placed to right) or, “I like that one 
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Table 1 
Rank Order Preferences for Four Canned Orange Juices 
13.2 °Brix 13.4°Brix 9.0°Brix 9.3 °Brix 
18.3 Brix- 9.9 Brix- 18.3 Brix- 9.9 Brix- 
Rank acid acid acid acid 
order ratio ratio ratio ratio 
1 83 25 7 5 
2 18 52 34 16 
3 13 24 39 44 
4 6 19 40 55 
Mn 1.52 2.31 2.93 3.24 
N 120 120 120 120 





x? = 228.36; df. = 9; P < .01. 


better” (placed to left). Each of the remain- 
ing juices was tasted and placed in the new 
row being developed. When the new order 
had been established the S took a sip of 
water and tasted the sequence again to verify 
his order of preference. He was permitted to 
change the order if he wished. The scoring 
was | to 4, starting with the most preferred 
juice—the one occupying the extreme right 
position. 

A rating scale was used for the method of 
single stimulus judgments. The scale was 
called a “Taste Thermometer.” The values 
ranged from 0 to 100 with gradations of five 
indicated and the tens numbered. Opposite 
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100 was the statement, “The best I have ever 
tasted”; opposite zero was, “The worst I have 
ever tasted.” At 50 was, “Fair; average.” 
The space between 50 and 100 contained the 
statement, “Better than average” and between 
O and 50 the statement, “Poorer than aver- 
age.” The S was told that he would be given 
one juice to drink and that he would then 
give ita score. The Ss were not told that the 
juice was one of the four in the rank order 
procedure. 


Results 


The results with the rank order procedure 
are presented in Table 1. The mean rank for 
each juice is given but chi-square was used as 
the test of significance. The order of pref- 
erence was 13.2 °Brix; 18.3 Brix-acid ratio, 
13.4 °Brix; 9.9 Brix-acid ratio, 9.0 °Brix; 
18.3 Brix-acid ratio, and 9.3 °Brix; 9.9 Brix- 
acid ratio. Chi-square tests of all pairings 
revealed that each juice was significantly dif- 
ferent from the other juices. 

Table 2 gives the data for the rank order 
procedure as a function of time of presenta- 
tion (before or after the single stimulus pro- 
cedure). In no instance were the rank order 
distributions for a juice significantly different 
between sessions. 

The analysis of variance results for the 
single stimulus data are shown in Table 3. 


Table 2 


Rank Order Preferences for Four Canned Orange Juices by Time of Presentation 
































13.2 °Brix 13.4 °Brix 9.0 °Brix 9.3 °Brix 
18.3 Brix-acid 9.9 Brix-acid 18.3 Brix-acid 9.9 Brix-acid 
ratio ratio ratio ratio 
Before After Before = After Before After Before After 
Rank single single single single single single single single 
Order stimulus stimulus stimulus stimulus stimulus stimulus stimulus stimulus 
1 38 45 17 8 2 5 3 2 
2 11 7 24 28 16 18 9 7 
3 8 5 10 14 19 20 23 21 
4 3 3 9 10 23 17 25 30 
Mn 1.60 1.43 2.18 2.30 3.05 2.82 3.17 3.32 
N 60 00 60 60 60 60 60 60 
x? = 2.17 x? = 4.65 x? = 2.32 x? = 0.995 
d.f. = 3 d.f. = 3 d.f. = 3 d.f. = 3 
P > .05 P > .05 P > .05 P > .05 
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Table 3 


Analysis of Variance for Single Stimulus Ratings of Four Canned Orange Juices 








Sum of 
Source of Variation Squares 


Square 





Juices 

Time of presentation 

Juices X Time of presentation 
Within 


4,895.00 
2,803.34 
1,028.34 
39,136.99 


Total 47 863.67 


4.669 (P < .01) 
8.022 (P < .01) 


1,631.67 
2,803.34 
342.78 
349.44 





F for variance between juices was significant. 
Inspection of the mean ratings, however, re- 
vealed the following: 13.2 °Brix; 18.3 Brix- 
acid ratio and 13.4 °Brix; 9.9 Brix-acid ratio 
had means of 58.00 and 57.83, respectively. 
The means for 9.0 °Brix; 18.3 Brix-acid ratio 
and 9.3 °Brix; 9.9 Brix-acid ratio were 46.00 
and 43.50, respectively. The differences were 
significant only when °Brix varied. In other 
words, the two high °Brix juices were signifi- 
cantly different from the two low ° Brix juices. 
Variation in Brix-acid ratio did not yield sig- 
nificant differences in preference. 

The variance between presentations was 
significant. The mean rating for all juices, 
when the single stimulus procedure came first, 
was 46.50. The mean rating for all juices, 
when this procedure followed the rank order 
method, was 56.17. 


Discussion 


The above results show that the preference 
pattern for four canned orange juices is a 
function of the experimental design used. 
When the Ss followed the rank order design 
both °Brix and Brix-acid ratio contributed 
to preference differentiation. With each S 
making only a single stimulus rating of one 
juice (the four juices being randomly as- 
signed to four such groups) preference differ- 
entiation occurred only in terms of “Brix. It 
will be noted that in both methods preference 
was associated with the relatively higher 
°Brix (the sweeter juices). 

A frame of reference factor was found in 
the analysis of variance of the data obtained 
by the method of single stimulus. Starting 
“cold” with this procedure produced rather 
low ratings for all juices. When this method 
followed the rank order procedure the mean 


ratings of the juices were appreciably higher. 

One limitation that must be placed upon 
these results rests in the fact that each S did 
not have experience with the four juices un- 
der single stimulus conditions. Another ex- 
periment, just completed, indicates that when 
such is the case no significant differences in 
mean ratings are obtained for juices that vary 
in Brix-acid ratio with °Brix held constant. 
This is the finding in the present experiment. 


Summary 


1. Preferences for four canned orange juices 
that varied in °Brix and in Brix-acid ratio 
were obtained by a method of comparative 
judgment procedure (rank order) and the 
method of single stimulus (using a rating 
scale). 

2. The rank order procedure produced pref- 
erence differences in terms of °Brix and of 
Brix-acid ratio. 

3. The single stimulus procedure produced 
differences only in terms of ° Brix. 

4. From both methods it appears that pref- 
erence is associated with juices of relatively 
higher °Brix (the sweeter juices). 


Received November 20, 1953. 
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Method of Single Stimulus Determinations of Taste Preference ' 


Forrest E. Clements, James A. Bayton, and Hugh P. Bell 


United States Department of Agriculture, Washington, D. C. 


There are two fundamental considerations 
that would lead one to select a method of 
single stimulus approach in research on pref- 
erences for variations of a food product. 
First, the method of single stimulus is a 
duplication of the situation that is typical for 
consumers. Seldom does the consumer have 
available in his home at a given time several 
variations of a particular food product—the 
situation that would be conducive to making 
preference judgments based upon the direct 
comparisons involved in the method of com- 
parative judgments. Realistic research on 
taste preference should attempt to utilize the 
actual home situation. 

The second factor that will force the ex- 
perimental design into a method of single 
stimulus model is the number of the varia- 
tions of the food product being tested since 
adaptation is a variable. It has been our ex- 
perience, in either laboratory or home situa- 
tions, that three items is the maximum num- 
ber for a paired comparison design when test- 
ing occurs in one session; with a rank order 
design four items is the maximum (1, 2, 3, 4, 
5). When the number of items is five or 
more comparative judgment models should be 
abandoned for a method of single stimulus 
design. 

Of necessity, any method of single stimulus 
design for determining taste preference will 
require the use of some type of rating scale 
or scoring system. Most scales used in such 
research are either point-scales with terse defi- 
nitions of each point or “thermometers” al- 
lowing for 0 to 100 scoring with descriptions 
such as “Excellent,” “Good,” etc., at selected 
points. When designing a research project 
that will involve a large-scale sample of a 
cross-section of consumers the scale or scor- 

1The canned orange juices used in this experi- 
ment were furnished by the Florida Experiment 
Station. The authors wish to acknowledge the ef- 
forts of Mrs. Motier Fisher and Mrs. Mary George 
Robinson who did the necessary field work. Discus- 


sions with Dr. Franklin R. Kilpatrick led to our de- 
velopment of Scale B. 


ing system used will have to be easily under- 
stood. Because of this, a third type of scale 
was investigated in the present experiment. 
This was a highly unstructured scale with 
only the extremes of the continuum defined. 
None of the points available for choice as ex- 
pressing degree of preference was identified 
or defined. The primary purpose of this ex- 
periment was to test the relative efficiency of 
the three types of scales in determining taste 
preferences when the research is conducted 
with a method of single stimulus design un- 
der realistic conditions. 

The particular food items used were three 
canned orange juices that varied in Brix-acid 
ratio with °Brix held constant. °Brix is a 
measure of the specific gravity of sugar solu- 
tions; Brix-acid ratio is a measure of the re- 
lation between the °Brix and the amount of 
acid in the solution. The lower Brix-acid 
ratios are characterized by tart-sour taste; 
the higher Brix-acid ratios are sweeter in 
taste. In addition, when °Brix is constant 
and amount of acid varies the juices change 
in body or consistency; the higher Brix-acid 
ratios tend to be “thinner” as well as sweeter. 

This research was preliminary to a large- 
scale investigation involving six canned orange 
juices. To facilitate this preliminary project 
the lowest and highest Brix-acid ratios in the 
six juices and one from the middle set were 
used. 


Procedure 


Scales. Scale A ranged from 0 to 100 with 
gradations of five numbered. To the right of the 
scale certain areas were bracketed and labeled. 
The area 90-100 was “Excellent”; 70-90 was 
“Very Good”; 50-70 was “Good”; 30-50 was 
“Fair”; 10-30 was “Poor”; 0-10 was “Very 
Poor.” The Ss were instructed first to decide 
what they thought of the juice in a general way 
—“Very Good,” “Poor,” etc.,—and then to rate 
it by assigning a score in the particular area. 

Scale B consisted of ten 5/16” squares ar- 
ranged vertically. Above the top square was 
“Excellent”; below the bottom one was “Very 
Poor.” No other definitions or descriptive state- 
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ments were given. The Ss were shown that their 
opinion about the juice could be expressed as 
falling anywhere from “Very Poor’ up through 
“Excellent.” They were to put a cross in the 
square that expressed their opinion. Scoring for 


Scale B was 1 to 10, starting with the bottom 
square. 
Scale C was the following 7-point scale: 


Excellent—the best canned orange juice I have 
ever tasted. 

Good—much better than other canned orange 
juice I have tasted; but not the best. 

Fair—a little better than other canned orange 
juice I have tasted; but not much better. 

Borderline—can’t decide whether it is better 
or worse than other canned orange juice I 
have tasted. 

Poor—a little worse than other canned orange 
juice I have tasted; but not much worse. 
Very Poor—much worse than other canned 

orange juice I have tasted; but not the worst. 
Objectionable—the worst canned orange juice 
I have ever tasted. 


The Ss were instructed to check the square 
preceding the statement that expressed their opin- 
ion about the juice. The scoring was 1 to 7, 
starting with “Objectionable.” 

Descriptive Check-List. After rating a juice 
the Ss were asked to check those items on a list 
that they thought described it. They could check 
as many items as they thought applied. The 
check-list contained items such as, “Too sweet,” 
“Too tart or sour,” “Just the right sweetness,” 
etc. 

Experimental Designs. The Ss were adult 
members of households in a new residential area 
adjacent to Alexandria, Virginia. The area con- 
sisted of about 600 homes. Approximately every 
seventh home was contacted, yielding a panel of 
90 households. To be eligible a household had 
to contain at least two adults who agreed to 
participate throughout the experiment. 

Test I. The purpose of Test I was to obtain 
preference ratings for the three canned orange 
juices on each of the three scales. The 90 house- 
holds were divided into three sets of 30 each. 
Each set received a given scale. On the first 
placement of the juices 10 households using a 
given scale received 12 Brix-acid ratio, 10 re- 
ceived 16 Brix-acid ratio, and 10 received 22 
Brix-acid ratio. These assignments were made 
in a random manner. Three placements were 
made per household until each S had experience 
with all of the juices. Three to four days in- 
tervened between placements. The homemakers 
were instructed to place the juice in the re- 
frigerator overnight and to make the tests the 
following day. 

The juices were in unlabeled 10-ounce cans, 
coded for identification purposes. Only one can 
of juice was left at a household per placement. 
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This was done so that all the juice would be con- 
sumed when tested. As stated, at least two Ss 
per household made the tests. 

Test II. The purpose of Test II was to in- 
vestigate the reproducibility of the ratings for 
the 12 Brix-acid ratio juice on each scale. Fif- 
teen of the 30 households that worked with a 
given juice in Test I were selected. The Ss were 
told merely that they were rating “another” juice 
in our set. This test was conducted about one 
month after the completion of Test I. 

Test III. The purpose of Test III was to in- 
vestigate the reproducibility of the preference re- 
lationship obtained with Scale B in Test I for 
the 12 and 22 Brix-acid ratio juices. This test 
took place approximately two months after Test 
I. The 30 households that had worked with 
Scale B in Test I took part in this test. One- 
half of the households received the 12 Brix-acid 
ratio juice on the first placement, the remainder 
received 22 Brix-acid ratio on the second place- 
ment; after three days, the juices were reversed. 


Results 


The mean preference ratings on Scale A for 
the 12, 16, and 22 Brix-acid ratio juices were 
61.0, 60.1, and 57.4, respectively. The dif- 
ferences were not significant. Scale B yielded 
mean ratings for the respective juices of 5.7, 
5.8, and 5.8. On Scale C the means were 5.3, 
5.1, and 5.0. Neither of the latter two scales 
produced significant differences between the 
juices. 

Table I presents the preference data in 
terms of whether an S scored the 12 Brix-acid 
ratio juice above or below the mean for that 
juice on a given scale. Those who scored the 
juice above the mean were designated as the 
“Like” group; those scoring it below the mean 
were called the “Dislike” group. On each 
scale, the “Like” 12 Brix-acid ratio group 
gave a significantly higher rating to that juice 
than to either the 16 or the 22 Brix-acid ratio 
juices. In these groups, however, the ratings 
for the latter two juices were not significantly 
different. Conversely, on each scale, the 
“Dislike” 12 Brix-acid ratio group tended to 
give higher ratings to the 16 and 22 Brix- 
acid ratio juices than to the 12 Brix-acid 
ratio juice. These differences were not sig- 
nificant for Scale A. In Scale C, the differ- 
ence between 12 and 16 Brix-acid ratio was 
significant; the difference between 12 and 22 
Brix-acid ratio was not significant. Both of 
the differences involved were significant on 
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Table 1 


Preference Scores for Canned Orange Juices that Vary in Brix-acid Ratio by 
“Liking” vs. “Disliking” Brix-acid Ratio 12 (Test I) 











——— 


12 16 





Scale A (N = 30) 





Brix-acid ratio (12 °Brix) 


“Like” Brix-acid Ratio 12 


t for mean difference 





22 12 vs.16 12vs.22 16 vs. 22 





Mn 76.7 66.8 61.3 3.2" 4.1** 1.3 
SD 7.0 16.0 17.7 

Scale B (N = 32) 
Mn a2 Be 5.7 4.0** age 0.3 
SD 1.1 23 2.6 

Scale C (N = 35) 
Mn 6.3 5.2 a2 5.0** 44 0.2 
SD 0.5 12 1.5 

“Dislike” Brix-acid Ratio 12 

Scale A (N = 29) 
Mn 44.8 53.2 53.3 2.0 1.8 0.1 
SD 11.4 18.5 19.3 

Scale B (N = 25) 
Mn 3.4 6.2 6.1 5.4** 41° 0.4 
SD 1.4 1.9 2.4 

Scale C (N = 25) 
Mn 4.0 5.0 4.6 a 1.3 0.8 
SD 1.2 1.2 1.8 





** Significant at the 1 per cent level. 


Scale B. For the “Dislike” 12 Brix-acid 
ratio group, none of the 16-22 Brix-acid ratio 
differences were significant. 

A similar analysis was made with the 16 
Brix-acid ratio juice. The general pattern 
was that when this particular juice was 
“liked” its mean was higher than those for 
either the 12 or 22 Brix-acid ratio juices. 
The respective differences involved were sig- 
nificant on each scale. When the 16 Brix- 
acid ratio juice was “disliked” its mean was 
lower than those for the 12 or 22 Brix-acid 
ratios. \Both of the differences involved were 
significant only for Scales B and C. 

Table 2 repeats the above analysis in 
terms of “Like” and “Dislike” 22 Brix-acid 
ratio for each scale. The pattern in this in- 
stance was for those who “liked” the 22 Brix- 
acid ratio juice to give the other two juices 
lower scores. Both of the particular differ- 
ences involved were significant on Scales A 
and B. Those who “disliked” the 22 Brix- 
acid ratio juice gave the other two juices 





higher scores than observed for the 22 Brix- 
acid ratio. Both differences were significant 
on each scale. 

The data on reproducability of the pref- 
erence ratings for the 12 Brix-acid ratio juice, 
after a month had passed, showed that for 
each scale the difference in preference rating 
was not significant. 

The results on reproducability of the pref- 
erence data on Scale B for the 12 and 22 
Brix-acid ratio juices two months after Test 
I demonstrated that the mean preference rat- 
ings for the two juices again were not signifi- 
cantly different. However, the division of the 
Ss into “liking” and “disliking” a respective 
juice revealed a pattern similar to that ob- 
tained in the original analysis. Those who 
“liked” a given juice tended to give the other 
one lower ratings; when a juice was ‘“dis- 
liked” the other juice was given higher rat- 
ings. The difference was not significant, how- 
ever, for those who “disliked” the 12 Brix- 
acid ratio juice. 
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Table 2 


Preference Scores for Canned Orange Juices that Vary in Brix-acid Ratio by 


Brix-acid ratio (12 °Brix) 


12 16 





“Like” Brix-acid Ratio 22 


Scale A (N = 
Mn 64.3 
SD 18.8 
Scale B (N 
Mn 6.4 
SD 2.1 
Scale C (N = 
Mn ~ A 4.9 
SD 1.3 


“Liking”’ vs. “Disliking’’ Brix-acid Ratio 22 (Test 1) 


i for mean difference 


22 12 vs.16 12vs.22 16 vs. 22 


73.0 
10.5 


pS 2.4* 


8 a 4.4** 3.8” 
3 


rp 
a 
5.8 
0.8 


“Dislike” Brix-acid Ratio 22 


Scale A (N = 
Mn 55.9 
SD 18.9 
Scale B (N = 
Mn 5.0 
SD 2.2 A 
Scale C (N 
Mn 49 5.4 
SD 1.3 1.0 


41.2 
10.7 


3.4 
1.2 


2.6 
1.1 





* Significant at the 5 pex cent level. 
** Significant at the 1 per cent level. 


Table 3 


Descriptions of Canned Orange Juices that 
Vary in Brix-acid Ratio 








Brix-acid ratio 
(12 °Brix) 





Description 12 16 22 





Per Per 

cent cent 
Too tart or sour 39 15 11 
Too sweet 11 21 28 
Too thin or watery 22 27 35 
Too artificial 25 31 30 
Just the right sweetness 35 44 38 
Just theright tartnessorsourness 18 17 

Does not taste like fresh orange 

juice, but still is pretty good 51 55 53 
Tastes like fresh orange juice 1 9 7 


Number 175 175 175 





Note: Percentages add to more than 100 because 
some Ss checked more than one descriptive item. 


The descriptions of the three canned orange 
juices by all Ss, regardless of scale used, are 
presented in Table 3. The percentage of Ss 
who described the juices as being too tart or 
sour decreased from the 12 to the 22 Brix- 
acid ratio. The percentage of Ss describing a 
juice as too sweet increased from the 12 to 
the 22 Brix-acid ratio. “Too thin or watery” 
was most frequently given for the 22 Brix- 
acid ratio juice. Approximately 50 per cent 
of the Ss said that although these juices did 
not taste like fresh orange juice they still 
were “pretty good.” 

In Table 4 the descriptions have been ana- 
lyzed in terms of those “liking” or “disliking” 
the 12 and 22 Brix-acid ratio juices. Those 
who “disliked” the 12 Brix-acid ratio juice 
tended to describe it as too tart or sour and 
too artificial. The Ss who “disliked” the 22 
Brix-acid ratio juice tended to say it was too 
sweet, too thin or watery, and too artificial. 
Approximately 50 per cent of those who liked 
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Table 4 


Descriptions of Canned Orange Juices by “Liking” vs. “‘Disliking” Brix-acid Ratio 12 
and Brix-acid Ratio 22 (12 °Brix) 








Description 


12 
Brix-acid ratio 


22 
Brix-acid ratio 








“Like” = “Dislike” “Like” 


Dislike’ 





Too tart or sour 

Too sweet 

Too thin or watery 

Too artificial 

Just the right sweetness 

Just the right tartness or sourness 

Does not taste like fresh orange juice, 
but still is pretty good 

Tastes like fresh orange juice 


Number 


Per cent 
14 
38 
47 


Per cent Per cent 
56 8 
10 10 18 
15 28 23 
8 46 10 53 
51 10 57 17 
4 10 28 1 


Per cent 
23 


74 
13 


74 


17 
100 


78 100 





Note: Percentages add to more than 100 because some Ss checked more than one descriptive item. 


a juice said it had “just the right sweetness.” 
“Just the right tartmess or sourness’” was 
more frequently used to describe the 12 and 
the 22 Brix-acid ratio among those who 
“liked” these respective juices. 


Discussion 


Test I can be viewed as three independent 
experiments on preferences for these canned 
orange juices; each experiment involving a 
different scale. The general pattern of the 
preference results was similar for each scale. 
Regardless of the scale, the means of the 
preference ratings, for all Ss, were not sig- 
nificantly different. It had been expected 
that the 12 Brix-acid ratio would be too tart 
and the 22 Brix-acid ratio too sweet, thus 
producing the highest mean preference rat- 
ings for the 16 Brix-acid ratio juice. This 
expectation was based upon the assumption 
that we would be dealing with a sample from 
one population. Taking the means for all Ss 
per juice at their face value one would con- 
clude that one juice was as likely to be pre- 
ferred as another. This, in turn, would raise 
the question of whether the Ss really could 
distinguish between the three juices in this 
method of single stimulus approach although 
prior comparative judgment experiments have 
shown that these juices are discriminable (3). 


The division of the Ss into “liking” or “dis- 
liking” a given juice gave evidence that, with 
respect to canned orange juices, there are two 
basic populations that we sampled. One 
population likes a tart juice; the other likes 
a sweet juice. Those who “liked” a tart 
juice gave it a relatively high score and gave 
a low score to the sweeter juice. The Ss who 
“disliked” the tart juice gave it a low score 
and assigned a relatively high score to the 
sweeter juice. Obviously, this phenomenon 
produced a cancelling-out effect on the means 
per juice for all Ss. This effect is particu- 
larly striking since the results were obtained 
with a method of single stimulus experimen- 
tal design. 

That this phenomenon is no artifact is seen 
in its demonstration in Test I with three dif- 
ferent scales. Further evidence is seen in the 
replication of the experiment (Test III) after 
two months, using the 12 and 22 Brix-acid 
ratio juices. Once again, the difference be- 
tween the juices, for all Ss, was not signifi- 
cant. However, analysis in terms of “liking” 
and “disliking” showed that the Ss who 
“liked” one juice scored it relatively high in 
contrast to the score given the other juice. 

The data from the descriptive check-list 
show that the Ss were responding to the criti- 
cal variables in these juices (tartness-sweet- 
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ness and body or consistency). Furthermore, 
the data from all Ss yield additional support 
for the conclusion that two populations were 
involved. The percentage describing a juice 
as too tart or sour decreased from the 12 
through 22 Brix-acid ratio juices. The re- 
verse was true for those calling these juices 
too sweet. When the descriptions were con- 
sidered in terms of “liking” or “disliking” 
one of these juices they revealed that the Ss 
were responding to the tart-sweet dichotomy. 

The Ss were asked whether they liked or- 
ange juice “somewhat on the tart side or 
somewhat on the sweet side.” Forty-seven 
per cent said they liked it tart, 46 per cent 
replied “sweet,” and 7 per cent volunteered 
the information that they liked it “medium,” 
“in-between,” etc. Although the distribution 
of these replies again supports the two-popu- 
lation concept they were not indicative of 
how the juices were scored. There was only 
low correlation between the replies to this 
question and the preference scores for the 
juices. This can only mean that the ques- 
tion does not locate those Ss who actually 
prefer tartness or sweetness under direct ex- 
perience with the juices. 

The question now becomes whether there 
was any difference in the efficiency of the 
three scales in revealing the preference pat- 
tern. It has been pointed out that the pref- 
erence pattern was similar for the three 
scales. Inspection of the t’s for the “Like”- 
“Dislike” data shows that Scale B tended to 
give higher significance values than the other 
two scales. The median t for Scale A was 
2.00, for Scale B was 3.04, and for Scale C 
was 2.51. 

The reproducability test with the 12 Brix- 
acid ratio juice did not produce significantly 
different ratings on any scale. However, it 
should be noted that Scale B came closer to 
doing this than did the other two scales. 

There is an indication that as the Ss con- 
tinued to work with these juices the prefer- 
ence ratings tended to rise. In Test I the 
means for all Ss for the 12 and 22 Brix-acid 
ratio were 5.70 and 5.84, respectively. In 
Test III the respective means were 6.73 and 
6.17. This supports the prior finding that 
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repeated experience with the juices, under 
single stimulus conditions, produces generally 
higher ratings (5). In spite of this general 
increase in preference the “Like”-“Dislike” 
patterns still existed. 

On the basis of the results of this experi- 
ment it was decided to use Scale B in a 720 
household study of preferences for six canned 
orange juices that vary in Brix-acid ratio, 
with °Brix constant. Scale B seems to be 
somewhat more efficient in revealing prefer- 
ence patterns and has the advantage of mini- 
mizing language and intellectual difficulties. 


Summary 


1. Using a method of single stimulus de- 
sign three canned orange juices that varied in 
tartness-sweetness and in body or consistency 
were given preference ratings. Three differ- 
ent scales were used, each S working with 
only one scale. 

2. Under method of single stimulus condi- 
tions the Ss were able to respond to the vari- 
ables of tartness-sweetness and body or con- 
sistency. 

3. The results indicated that there are two 
populations with respect to preference for 
canned orange juice—one prefers a tart juice, 
the other a sweet one. 

4. A relatively unstructured scale, with 
only the ends of the continuum defined, 
tended to be most efficient. All three scales, 
however, revealed the same pattern of prefer- 
ence. 


Received December 12, 1953. 
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Although engineers and human engineers 
frequently recommend the use of inclined 
visual displays and panels on control con- 
soles, there is virtually no scientific evidence 
to show that this design practice has any 
measurable effect on operator efficiency (2). 
The present study was undertaken to dis- 
cover whether tilting the keyset now used by 
long-distance operators would have any effect 
on their keying performance. In view of the 
lack of experimental evidence in this area, 
we believe that our findings may be of gen- 
eral interest. 


Experimental Method 


Apparatus. The long-distance operator’s key- 
set is a ten-button set with the numbers and let- 





OO@O®®Oo 
O\QOOoO® 
OC @oo 





Z 








Fic. 1. Top view of a toll-operator’s keyset. 


* The experiment reported here was done at the 
Bell Telephone Laboratories. 
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ters arranged in two vertical rows of five. A 
third column contains two keys, the KP key, 
which sets up the apparatus to receive a number 
sequence, and the ST key, which clears the ma- 


~chine of the sequence just keyed (see Figure 1). 


The keyset is normally mounted on a horizontal 
working surface about 13 inches back from the 
front edge of this surface, and 9 inches to the 
right of center. The experimental apparatus 
shown in Figure 2 approximates the toll-opera- 
tor’s position in its essential dimensions. In 
normal operation, the keyset is horizontal. In 
the present study, the keyset was mounted on 
hinges so that it could be inclined at eight angles 
relative to the working surface: 0, 5, 10, 15, 20, 
25, 30, and 40 degrees. 

A remotely-located recorder printed numbers 
corresponding to the ten number-letter keys of 
the keyset. 

The illumination of the experimental room was 
constant throughout the test at an adequate in- 
tensity. 

Materials. The stimuli for the keying task 
were ten-place number and letter combinations of 
the following form: 3 digits, space, 2 letters 1 
digit, space, 4 digits. In long-distance operation 
the first three digits are the ‘‘code” to the dis- 
tant location, the two letters and the next digit 
give the subscriber’s exchange, and the remaining 
four digits the subscriber's number. For these 
tests, the numbers were obtained from a table 
of random numbers. The letters were also se- 
lected randomly from a special table which en- 
sured that all letter combinations appeared 
equally often (except that the letters Q and Z 
were never used). The stimuli were presented to 
the subject in list form. Twenty-four different 
lists, each containing 50 stimuli, were used in the 
first part of the experiment (practice sessions) ; 
256 different lists, each containing 100 stimuli, 
were used in the second part (test sessions). 

Experimental Design. The experiment was di- 
vided into two consecutive parts, practice sessions 
and test sessions, each part extending over eight 
days. Two pairs of 8 by 8 Latin squares (four 
in all) were used, the main-effect variables of 
each being: (1) subjects; (2) days; and (3) in- 
clinations of the keyset. One pair of identical 
Latin squares, was assigned to the practice ses- 
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Effect on Performance of Tilting the Toll-Operator’s Keyset 


























Fic. 2. A schematic illustration of the toll-operator’s 


position in this experiment. 


sions; the other pair of identical Latin squares 
to the test sessions. One square of each pair was 
assigned to 8 male subjects and the other to 8 
female subjects. 

Procedure. Instructions read to each subject 
at the beginning of the first practice session cov- 
ered these essential points: 


1. Position of the subject with respect to the 
keyset. 

2. Technique of keying. (All keying was done 
with the first or second finger of the right 
hand.) 

. The criterion. (Primary emphasis was placed 
on accuracy.) 

. Procedure to follow when an error was dis- 
covered. 


Because this last instruction required the subject 
to stop and rekey the entire number whenever he 
thought he made an error, the time and error 
measurements are not entirely independent. 
Later on, however, we will see that this is not 
an important consideration. 

On each day of the practice sessions, all sub- 
jects keyed three number lists with a 5-minute 
rest between lists. On each day of the test ses- 
sions, all subjects keyed two number lists with a 
10-minute rest between lists. 

Subjects. Sixteen subjects, eight male and 
eight female, participated in this study. Their 
ages were between 18 and 35. No subject had 
previous experience on this keyset. One female 
subject did not participate on the last day of the 
test sessions. 


Results 


All data expressed in the following graphs 
have been computed from individual error 
and time values. An individual error value 
is the percentage of incorrect keyings made 
by a subject, based on keying 150 numbers 
in a practice session or 200 numbers in a test 
session. An individual time value is the to- 
tal time required for a subject to key the 
three number lists in a practice session or the 
two number lists in a test session. 

For this kind of experimental design, an 
analysis of variance is usually employed to 
evaluate the data. Although such analyses 
were carried out in the present study, the re- 
sults are much more clearly described in the 
accompanying graphs. All graphs depict 
three statistical measures: (1) the arithmetic 
mean; (2) the mean plus and minus one 
standard deviation; and (3) the total range 
of values. 

Effect of Tilt. Figures 3 and 4 clearly 
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Fic. 3. The data at each inclination are based on 
the percentage of errors made by 16 subjects each of 
whom keyed 150 numbers (practice sessions) or 200 
numbers (test sessions). (At 25° for the test ses- 
sions there were only 15 subjects.) The short hori- 
zontal line is the mean, the solid vertical bar the 
mean plus and minus one standard deviation, the 
thin vertical bar the range of individual error per- 
centages. 
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Fic. 4. These time data correspond to the error 
data of Figure 3. The basic datum is the total time 
required by each subject to key a set of numbers. 
The mean, standard deviation, and range are repre- 
sented as in Figure 3. The:arrow at 25° shows the 
mean estimated for 16 subjects (see text). 


demonstrate that inclination of the keyset has 
virtually no effect on keying performance, 
either in terms of error or time. Figure 3, 
for example, shows that the averages for the 
test sessions are within a small range: 2.5 to 
3.7 per cent. Moreover, there is no evidence 
of any systematic trend in the mean values 
as a function of keyset inclination. A straight 
line with zero slope appears to fit these data 
adequately. It is not likely that the data are 
appreciably affected by the fact that one sub- 
ject was not tested at the 25-degree inclina- 
tion. 

Figure 4 also shows that the average times 
lie within a small range. For the practice 
sessions this range is 1.5 minutes (27.3 to 
28.8 minutes). For the test sessions the 
range is 1.1 minutes (31.1 to 32.3 minutes) 
provided that the estimated value for 25 de- 
grees is used. Since the subject who was not 
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Fic. 5. The error data for each day are based on 
the performance of 16 subjects, except that on the 
last day there were only 15 subjects. The mean, 
standard deviation, and range are represented as in 
Figure 3. The curves through sucéessive means were 
drawn by inspection. 


tested at 25 degrees (Subject E, Figure 8) 
had the longest average keying time, the 
mean for 25 degrees is undoubtedly too low 
because of this omission. The arrow in Fig- 
ure 4 shows the estimated value for the mean 
on the assumption that Subject E had turned 
in a value equal to her average keying time. 

Learning. Figures 5 and 6 show the course 
of learning in terms of errors and time, re- 
spectively. Both show a large and signifi- 
cant decrease due to learning throughout the 
first eight days of test. Errors do not show 
a significant decline in the second eight days, 
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Fic. 6. These time data correspond to the error 
data in Figure 5. The mean, standard deviation, 
and range are represented as in Figure 3. The 
curves through successive means were drawn by in- 
spection. 
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Fic. 7. These error data show each subject’s per- 
formance for the eight test sessions. (The data for 
E are based on only seven test sessions.) The mean, 
standard deviation, and range are represented as in 
Figure 3. 


although the keying times do. It is apparent, 
therefore, that learning was not complete even 
after 16 days of test. In this experiment 
the effects attributable to learning are much 
greater than those produced by variations in 
the inclination of the keyset. 

Individual Differences. Figures 7 and 8 
are plots of individual keying performances 
in terms of error and time for only the test 
sessions. These graphs show the most im- 


portant source of variance in our experiment. 
The averages in Figure 7 cover a range from 


0.5 to 6.6 per cent. In Figure 8 the range is 
from 24.8 to 39.7 minutes. 

Rank-order correlation coefficients between 
average errors and average times for the test 
sessions were computed for the male and fe- 
male subjects separately. For the female 
subjects, the coefficient was + 0.07; for the 
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Fic. 8. These time data correspond to the error 
data in Figure 7. The mean, standard deviation, 
and range are represented as in Figure 3. 
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male subjects, — 0.76. We have no explana- 
tion for the difference between the magnitudes 
of these two correlation coefficients. 


Discussion 


In the specific work situation of the pres- 
ent study, we have found performance to be 
unaffected by the inclination of the working 
surface. However, spontaneous comments 
from all of our subjects indicated that they 
preferred an inclined keyset surface to a hori- 
zontal one. Furthermore, about half of the 
subjects expressed a preference for a keyset 
inclination between 15 and 25 degrees. 

These subjective preferences, as well as the 
quantitative data, are in agreement with an- 
other specific investigation that was concerned 
with speed and accuracy of target indication 
on a radar which was mounted at various in- 
clinations (1). Since the nature of the tasks 
in these two situations differs so radically, 
the agreement between the two sets of re- 
sults suggests that we can perhaps apply the 
findings to other work situations. If a work- 
ing surface is clearly visible to the operator 
and if it is within easy reach, inclining the 
work surface will probably not result in any 
measurable effect on performance. People 
seem to like inclined surfaces better than 
horizontal ones, but we have no way of evalu- 
ating the importance of such preferences. 

Many of the standard deviations in Fig- 
ures 2 through 7 are large because the data 
are not homogeneous, i.e., they include sev- 
eral sources of variance. For example, the 
standard deviations for each keyset inclina- 
tion in Figures 3 and 4 include the differences 
between subjects and the differences between 
days, both of which are large. In Figures 5 
and 6, the standard deviations include differ- 
ences between subjects and between inclina- 
tions. In this case the standard deviations 
are smaller because, as we have seen, varia- 
tions produced by keyset inclination are small. 
In Figures 7 and 8, the standard deviations 
are small because the variations attributable 
to-inclinations and days (for the test sessions 
only) are also small. 

Earlier we noted that the time and error 
scores are not independent. If this were an 
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appreciable factor in this experiment, we 
should expect the two values to be positively 
correlated. Actually they are not. In addi- 
tion, we should note that in the test sessions 
there were only a few errors committed and, 
of those made, less than one in three was de- 
tected and rekeyed by the subject. All in 
all, therefore, we do not believe that this is 
an important consideration in these data. 


Summary 


The present experiment investigated two 
measures of keying performance, accuracy 
and time, as a function of inclination of the 
keyset. The keyset was inclined at eight 
angles, 0, 5, 10, 15, 20, 25, 30, and 40 de- 
grees, relative to the working surface. 

The test was divided into two parts, prac- 
tice sessions and test sessions. The subject’s 
task was to key lists of ten-place number and 
letter combinations. Eight by eight Latin 
squares were used, the principal variables be- 
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ing subjects, days, and inclinations of the key- 
set. The results clearly demonstrate that: 


1. Keying accuracy and keying time are 
independent of the inclination of the keyset. 

2. Both accuracy and speed increased sig- 
nificantly throughout the sixteen days of test. 

3. The greatest source of variation in this 
experiment is that produced by differences 
between subjects. 
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The Use of a Joy-Stick in Making Settings on a Simulated Scope Face * 


William Leroy Jenkins and A. Charles Karr 
Lehigh University 


An earlier study ' reported the use of levers 
in making settings on a linear scale. The 
most important variable proved to be the 
ratio between the movement of the lever tip 
(L) and the movement of the pointer (P). 
An L/P ratio of approximately three was 
found to be optimal. The current investiga- 
tion extends the problem into two dimensions, 
using a joy-stick to set a cursor on a simu- 
lated scope face. 

An operational diagram of the apparatus 
is shown in Figure 1. A vertical twelve-inch 
aluminum disc, with its center at approxi- 
mately eye-level and about 24” from the sub- 
ject’s eyes, simulates a scope face. Seven 
quarter-inch circular lucite inserts are spaced 
around a ten-inch diameter, six inserts around 
a seven-inch diameter, and four around a 
three-inch diameter. The cursor (a brass 
disc .150” in diameter) is controlled by a joy- 
stick placed between the subject’s knees with 
its tip about six inches below the edge of the 
simulated scope. 

Right-left components of the joy-stick move- 
ment are transmitted through the lower shafts 
to the small pulley and then to the upper 
shaft, causing the long cylinder to move right 
and left across the simulated scope face. 
Various ratios of movement between joy- 
stick and ‘upper shaft are obtained by shift- 
ing the belt attachments along the bar at the 
end of the upper shaft. 

Front-back components of the joy-stick 
movement operate a hydraulic pump that 
serves to move the piston up and down in the 
long cylinder. Ratios between movement of 
the joy-stick and movement of the piston are 
changed by sliding the attachment of the 
hydraulic pump up or down on the joy-stick. 

* This research was executed under Contract AF 
18(600)-24 between the Institute of Research, Le- 
high University, and the USAF Wright Air Develop- 
ment Center, Aero Medical Laboratory, Wright-Pat- 
terson Air Force Base, Dayton, Ohio. 

1 Jenkins, W. L. and Olson, M. W. The use of 
levers in making settings on a linear scale. J. appl. 


Psychol., 1952, 36, 269-271. Also USAF Technical 
Report No. 6563. 


Since the viscous friction of the right-left 
system is less than that of the front-back 
system, it is necessary to equalize the kines- 
thetic feel by adding viscous friction to the 
right-left system. This is done by adjusting 
a Prony brake, liberally coated with graphite 
lubricant, which adds a viscous drag, until the 
right-left viscous friction seems equal to the 
front-back friction. 

The cursor and scoring mechanism are 
mounted at the top of the piston that moves 
up-and down in the long cylinder. The scor- 
ing mechanism operates as follows: When the 
subject has completed a setting he pushes 
a switch which discharges a condenser into 
a small electromagnet. The electromagnet 
moves the lucite strip bearing the brass 
cursor, so that the brass disc comes in con- 
tact with the scope face for a fraction of a 
second. If the cursor touches only the lucite 
insert, no electrical contact is made and a 
green light glows. If the cursor is not en- 
tirely within the confines of the insert, elec- 
trical contact is made between the brass disc 
and the aluminum scope face, lighting a red 
light to indicate a mis-setting. 


Procedure 


The procedure for a single setting is as 
follows: Following a ready signal, the ex- 
perimenter moves a switch that simultane- 
ously lights one of the inserts and starts the 
timing clock. The subject moves the joy- 
stick to bring the cursor onto the lighted in- 
sert, and then pushes a button that simul- 
taneously operates the scoring mechanism and 
stops the timing clock. The elapsed time on 
the clock shows the setting time, and a green 
or a red light indicates whether the setting is 
correct. 


Results 


For clarity, the results will be described in 
five parts, paralleling the chronological order 
of the experiments. 
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Fic. 1. 
L/P Ratio in General. By combining vari- 
ous lever lengths and apparatus settings, 
twelve L/P ratios between 1.0 and 3.9 were 
tested, with nine target positions. Each of 
20 subjects made 20 settings at each com- 
bination of L/P ratio and target position. 

Table 1 shows for the twelve L/P ratios 
the mean setting time, variability, and mis- 
settings. In all three respects, ratios of 2.0 
and above appear to be clearly more favor- 
able than the ratios of 1.7 and under. Al- 
though the highest L/P ratios are obtained 
with the longer levers, each of the longer 
levers is also represented among the lower 
(unfavorable) L/P ratios. 

When the data are re-analyzed to compare 
the favorable ratios (2.0 and up) with the 
unfavorable ratios (1.7 and down) for each 
of the 20 subjects individually, in all 20 sub- 
jects, there is a saving in setting time with 
the favorable ratios. In 19 of the 20 sub- 
jects, there is likewise a decrease in vari- 
ability, and in 18 of the 20 subjects a de- 
crease in mis-settings. 

When the data are analyzed according to 
the nine target positions, at each target po- 
sition there is a saving in setting time, a de- 
crease in variability, and a reduction in mis- 
settings with the favorable ratios. 
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Operational Diagram of Joy-Stick Apparatus. 


William Leroy Jenkins and A. Charles Karr 


y, 


ee 
[cae] 
\. ' ‘/-—Cursor and 
- scoring mechanism 





| —— Joy-stick 


Adjustable 
pump attachment 





D 
ve 


versing val 


Different Lever Lengths with Favorable 
L/P Ratios. The aim of the next set of ex- 
periments was two-fold: to determine whether 
lever-length as such was significant within the 
favorable L/P ratios, and to see whether there 


Table 1 

Each Value is the Mean of 3,600 Settings 
(20 Subjects X 9 Target Positions 

X 20 Settings) 














Mean Mean 

L/P Lever Setting Variability Mis- 

Ratio Length Time (rms ofa’s)* settings 
1.0 Yad 2.58 sec. 0.80 sec. 74% 
1.0 18” 2.48 sec. 0.70 sec. 6.6% 
1.4 24” 2.20 sec. 0.54 sec. 4.8% 
1.6 2” 2.23 sec. 0.58 sec. 5.0% 
1.7 30” 2.18 sec. 0.54 sec. 3.9% 
2.0 18” 2.02 sec. 0.46 sec. 4.3% 
2.0 24” 2.02 sec. 0.44 sec. 4.8% 
2.5 30” 1.99 sec. 0.40 sec. 3.7% 
a4 24” 1.93 sec. 0.37 sec. 3.3% 
3.1 24” 1.92 sec. 0.35 sec. 4.0% 
3.4 30” 1.94 sec. 0.33 sec. 3.2% 
3.9 30” 1.95 sec. 0.34 sec. 3.5% 





*rms is the square root of the mean of the squares 
of the standard deviations, i.e., 
on? 


= a ee : 
n 
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was any indication of an optimal ratio within 
the favorable region. Accordingly, lever 
lengths of 12, 18, 24, and 30 inches were em- 
ployed with apparatus settings to give L/P 
ratios of 2.0, 2.5, and 3.0 (except that it was 
not possible to reach an L/P ratio of 3.0 with 
the 12” lever in the present apparatus). Tar- 
get positions were restricted to the four on 
the three-inch diameter and the six on the 
seven-inch diameter. Each of 19 subjects 
made 20 settings at each of 10 positions using 
each of the 11 lever-ratio combinations. 

Table 2 shows the findings. It is evident 
that lever-length as such plays little or no 
part in the outcome. However, the L/P 
ratio of 2.5 is slightly superior to 2.0 in set- 
ting time, variability, and mis-settings, and 
inferior to 3.0 only in mis-settings. For con- 
venience the L/P ratio of 2.5, being lower, 
will be considered optimal. 

Starting Positions. Up to this point the 
starting position of the cursor was always 
at the bottom of the simulated scope face. 
The question was raised whether this was the 
best starting position. In the next series of 
experiments, five starting positions were used: 
top, bottom, right, left, and center of the ten- 
inch diameter circle on which the outer seven 
inserts were located. Each of 12 subjects 
made 20 settings at each of 17 target posi- 


Table 2 


Lever Length and Optimal L/P Ratio 


Note: Each value is the mean of 3,800 settings (19 
subjects X 10 target positions X 20 settings). 








Mean 
Variability 
(rms of o’s) 


Mean 
Setting 
Time 


L/P 
Ratio 


Lever 
Length 


Mis- 
settings 


2.2% 
2.4% 
2.4% 
2.2% 


2.2% 
21% 
2.3% 
2.0% 








2.0 = 1.71 sec. 
18” 1.66 sec. 
24” 1.64 sec. 
30” 1.72 sec. 


0.32 sec. 
0.33 sec. 
0.31 sec. 
0.35 sec. 


= 1.63 sec. 
18” 1.59 sec. 
24” 1.59 sec. 
30” 1.63 sec. 


0.28 sec. 
0.26 sec. 
0.27 sec. 
0.30 sec. 


12” — 

18” 1.60 sec. 
24” .57 sec. 
30” .59 sec. 


1.4% 


0.25 sec. 1.7% 


0.26 sec. 


0.25 sec. 


1.4% 





Table 3 
Influence of Starting Position on Performance 
L/P ratio = 2.5 


Note: Each value is the mean of 4,080 settings (12 
subjects X 17 target positions X 20 settings). 


Lever = 24” 





Mean 
Variability 
(rms of o’s) 


Mean 
Setting 
Time 


Starting 
Position 


Mis- 
settings 


2.2% 
2.9% 
1.9% 
2.6% 
3.2% 





2.08 sec. 
1.99 sec. 
2.01 sec. 
2.01 sec. 
1.88 sec. 


0.38 sec. 
0.39 sec. 
0.38 sec. 
0.40 sec. 
0.36 sec. 


Bottom 
Right 
Left 
Center 





tions (including the seven on the ten-inch 
diameter), using the L/P ratio of 2.5, and 
starting from each of the five positions. Av- 
erage travel distance was thus the same for 
all starting positions. 

Table 3 shows the results. In terms of 
mean setting time and variability the center 
position is slightly superior. On the other 
hand, in percentage of mis-settings the center 
starting position is the worst of the five. 

In another analysis of the same data, the 
best starting position for each subject was de- 
termined in terms of each of the three cri- 
teria. In setting time, the center position is 
best for eight out of twelve subjects. In 
variability, the center position is best for five 
subjects. But in mis-settings no position 
stands out as being best. In overall view, it 
seems that starting position is relatively un- 
important. 

Reversed Front-Back Operation. In nor- 
mal operation, the cursor moved upward 
when the joy-stick was pushed away from the 
subject and downward when the joy-stick was 
pulled toward the subject. A question was 
raised concerning the effect on the optimal 
L/P ratio if this operation was reversed so 
that the cursor moved upward when the joy- 
stick was pulled toward the subject and vice 
versa. 

Each of 17 subjects made 10 settings at 
each of the 10 inner target positions with 
each of five ratios (Trials 1-10), using the 
normal direction of operation. He then made 
40 settings at each of the 10 target positions 
with each of five ratios (Trials 11-20, 21- 
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30, 31-40, and 41-50), using the reversed di- 
rection of operation. Finally he made an- 
other 10 settings at each of the 10 target po- 
sitions with each ot the five ratios (Trials 
51-60). 

Table 4 shows by blocks of 10 trials the 
results in mean setting time, mean variability, 
and mis-settings. Two points can be noted: 
First, by the end of 40 trials with reversed 
operation (comprising 2,000 individual set- 
tings per subject) performance with reversed 
operation approached performance with direct 
operation, indicating that the subjects learned 
to handle what they all called an unnatural 
relationship of joy-stick and cursor move- 


William Leroy Jenkins and A. Charles Karr 


ment. Second, for both conditions, an L/P 
ratio of 2.5 is the lowest that can be called 
optimal. 

Subject’s Switch. In all the studies just 
described, the subject’s switch was held in 
the hand that was not operating the joy- 
stick. A question was raised as to whether 
other types of switching would affect the per- 
formance. Two other types of switches were 
added: A push-button was located at the top 
of the upper end of the joy-stick, operating 
with very light pressure. A foot-pedal, with 
enough spring resistance to bear the weight 
of the subject’s foot, was placed at a con- 
venient position on the floor. 


Table 4 


Performance with Reversed Direction of Operation of Joy-stick and Cursor 


Note: Each value is the mean of 1,700 settings. 


(17 subjects X 10 target positions X 10 settings) 
Lever Length 24” 








Mean Setting Time (seconds) 





























L/P Ratios 
Trial Nos. Operation 1.4 1.9 2.2 2.5 3.0 
1-10 (Direct) (1.81) (1.65) (1.68) (1.66) (1.64) 
11-20 Reversed 2.16 2.05 2.09 2.02 2.03 
21-30 Reversed 2.03 1.89 1.86 1.91 1.88 
31-40 Reversed 1.95 1.82 1.80 1.78 1.80 
41-50 Reversed 1.84 1.74 1.72 1.69 1.72 
51-60 (Direct) (1.72) (1.58) (1.58) (1.56) (1.52) 
Mean Variability (rms of o’s in sec.) 
L/P Ratios 
Trial Nos. Operation 1.4 1.9 2.2 2.5 3.0 
1-10 (Direct) (0.38) (0.29) (0.30) (0.28) (0.25) 
11-20 Reversed 0.49 0.44 0.48 0.43 0.41 
21-30 Reversed 0.42 0.34 0.34 0.35 0.33 
31-40 Reversed 0.38 0.33 0.32 0.31 0.32 
41-50 Reversed 0.36 0.30 0.30 0.31 0.29 
51-60 (Direct) (0.32) (0.25) (0.25) (0.25) (0.23) 
Mis-settings (percentage) 
L/P Ratios 
Trial Nos. Operation 1.4 1.9 an 2.5 3.0 
1-10 (Direct) (11.5%) (7.9%) (7.2%) (6.7%) (6.6%) 
11-20 Reversed 10.5 72 8.1 7.1 6.1 
21-30 Reversed 8.6 6.9 7.6 4.8 7.0 
31-40 Reversed 6.9 5.1 5.4 5.0 5.5 
41-50 Reversed 6.9 6.2 5.4 4.9 5.4 
51-60 (Direct) (6.9) (5.8) (4.7) (4.8) (3.8) 
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Table 5 


Influence of Type of Switch on Performance 
Note: Each figure is the mean of 5,100 settings (10 
subjects X 17 positions X 30 settings). 
Ratio 2.5, Lever Length 24” 








Mean 
Variability 
(rms of o’s) 


Mean 
Setting 
Time 


Mis- 
settings 


8.9% 
10.1% 
8.2% 


Type of 
Switch 
Other hand 

, Joy-stick tip 
Foot pedal 





1.46 sec. 
1.47 sec. 
1.47 sec. 


0.19 sec. 
0.19 sec. 
0.19 sec. 





Each of 10 subjects made 30 settings at 
each of 17 target positions with each of the 
three types of switches. A 24” lever and an 
L/P ratio of 2.5 were employed throughout. 

Table 5 shows the results. It is evident 
that all three types of switches are about 
equal in terms of mean setting time, mean 
variability, and mis-settings. Apparently any 
one of these three types of switches, which- 
ever is most convenient, can be used without 
affecting performance. 


Summary 


A series of experiments was performed to 
determine the significance of certain variables 
in the use of a joy-stick to make settings in 
two dimensions on a simulated scope face to 
a relatively coarse tolerance. 
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The most significant factor turns out to be 
the ratio between the movement of the joy- 
stick tip and the movement of the cursor. 
The lowest ratio that can be considered opti- 
mal is about two-and-a-half. That is, the 
tip of the joy-stick should move two-and-a- 
half times as fast as the cursor. 

Other variables proved to be relatively un- 
important. Joy-stick lengths of 12”, 18”, 24”, 
and 30” are equally effective. Starting posi- 
tion (top, bottom, right, left, or center of the 
scope) makes little difference in the overall 
results. Reversed operation (cursor moving 
down when stick is pushed away from the 
operator) is slower but the optimal ratio is 
the same. Finally, results are not affected by 
the position of the subject’s switch, whether 
it is operated by the hand not holding the 
joy-stick, by a foot-pedal, or by the same 
hand that moves the joy-stick. 

It should be emphasized that these results 
were obtained in a situation where the move- 
ment of the joy-stick is translated directly 
into movement of the cursor. The present 
type of apparatus does not permit making 
tests of a similar nature with joy-stick con- 
trols where the movement of the pointer is 
determined by pressure rather than by ex- 
tent of movement of the joy-stick. 


Received November 27, 1953. 
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Figure and Ground in a Two Dimensional Display * 


R. C. Browne 
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In an aircraft the subjective feelings of 
passenger or pilot are little guide to the atti- 
tude which the machine assumes, and they 
are even less guide when it is flying in the 
dark or in cloud when there is nothing to 
which external reference can be made. An 
indicator (Figure 1) was therefore developed 
to provide the pilot with a visible display of 
how the attitude of his aircraft varies in re- 
lation to an artificial horizon which is gyro- 
scopically stabilized. It shows in two dimen- 
sions whether the aircraft is climbing, diving, 
or banking and also, on a scale, the amount 
of bank, in degrees from the horizontal. This 
display provides a “figure and ground” prob- 
lem in that it is the horizon which apparently 
moves and not the aircraft. This does not, of 
course, accord with the facts, although it does 
with the appearance of the horizon as seen 
from the aircraft. Because of this, it was 
thought likely that air pilots often made 
wrong control movements and so increased 
the departure of their machine from the 
straight and level attitude. To meet these 
objections a new display was designed in 
which a diagrammatic aircraft moved in ref- 
erence to a stationary horizon. The problem 
in its crude ad hoc form was, therefore, to 
decide which of these two displays was the 
more suitable. 

An initial examination of the two displays 
showed that they differed in three respects: 

1. In the old method (Figure 1D) the 
“figure” or miniature aircraft is stationary, 
whereas in the new (Figure 1A) it moves 
against a “ground” composed of a_ horizon 
which is still. 

2. The old display is provided with a scale 
and pointer which shows how many degrees 
the aircraft is banking to one side or the 
other, but in the reversed sense. 


* Acknowledgments are due to the Medical Direc- 
torate, British Royal Air Force, for permission to 
publish this paper, and to Mr. H. Campbell, B.A., 
F.S.S. for statistical advice. 


3. The old instrument is less heavily 
“damped” than the new; in other words, the 
new display takes rather longer to come to 
rest after a given deflection. 


Method 


The classical method of studying a display 
problem is with a tachistoscope. But, on the 
other hand, where machinery is controlled in re- 
sponse to alterations in an indicator (as in the 
present study) and some movement in a control 
system has to be made, it is perhaps better to 
assess the different displays in a comparable way 
by requiring the experimental subjects to make 
control movements in response to changes in 
them, and to measure the speed and accuracy 
with which they do so. 

A standard instrument flying trainer was, there- 
fore, used as the machine to be controlled, and 
it was fitted with a recording apparatus which 
integrated the speed and accuracy with which 
deflections from the straight and level attitude 
were corrected. It gave a numerical score every 
two minutes. The test lasted for eleven minutes 
which allowed time for four such scores to be 
made and noted. The attitude of the machine in 
the test was made to change quickly in a cyclical 
fashion which repeated itself every eighteen sec- 
onds. The task before the subjects of the ex- 
periment was to correct the changes in attitude 
which were conveyed by one or other of the two 
indicators. The hood of the trainer was shut, so 
that no fixed external reference point could be 
seen, and it was so arranged that there were no 
turning movements. A number of cadets chosen 
at random from a large group who had already 
been selected to be air pilots and who were, 
therefore, quite homogeneous, were the subjects 
of the experiment. But they were at a stage in 
training when they had had no experience in atti- 





Fic. 1. The two displays. The New (A) is on the 


left and the Old (D) on the right. 
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Table 1 


The Numbers in Each Group of Subjects and the 
Order in Which They Were Tested on 
the Various Displays 








Display 
Order 
of Test 


Subjects 


No. of 





Group Letter 





20 a 
20 b 
10 
10 
10 
10 


20 
20 














Trained Air Pilots 
10 
10 
10 
10 


tude display indicators. In this way, bias due to 
familiarity with either display was avoided. Ev- 
ery man received a comparable explanation, and 
was allowed to practice until he could just do the 
test without damaging the apparatus. The test 
was kept short—eleven minutes—to avoid fatigue 
and fluctuating levels of attention. 

The experiment was divided into five parts as 
shown in Table 1. 

1. Two groups of 20 subjects (a) and (b) 
were chosen. Group (a) was tested on the old 
indicator (D) and Group (b) on the new indi- 
cator (A). 

2, Twenty new subjects were chosen and di- 
vided into two groups of ten (c) and (d). 
Group (c) was first tested on the old indicator 
(D) and then on the new indicator (A). Group 
(d) carried out the same two tests in the reverse 
order. 

3. The old indicator (D) was partially covered 
with black paper to make it comparable to (A) 
in every respect except the figure and ground re- 
lation and the degree of damping. A new group 
of 20 subjects (e) was tested on this display (C). 

4. The display (C) was further modified so 
that the damping was comparable to (A) and an- 
other group of 20 subjects (f) was tested on this 
new display (B). (B) now resembled (A) in 
every respect except the figure and ground rela- 
tion. 

5. Two groups of ten experienced pilots (g) 
and (h) who had trained on the old display (D) 
were chosen. Group (g) was first tested on the 
old indicator (D) and afterwards on the new 
indicator (A). Group (h) carried out the same 
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two tests in the reverse order. For this experi- 
ment the damping was made comparable on the 
two displays. 


Results 


Figure 2 plots the means of the four scores 
for every subject for the roll or side-to-side 
movements, and Figure 3 shows similar scores 
for the pitch or fore and aft movements. In 
parts 1 and 2 of the experiment 30 subjects 
(groups a and c, Table 1) had their first test 
on the old display (D) and another compa- 
rable 30 (groups b and d) on the new (A). 
It was found that 5.0 fewer errors were made 
in roll on (A) than on (D) which seems un- 
likely to be due to chance, since t = 2.75 and 
P=1 in 100. In pitch the difference is © 
smaller (1.5 errors) as, indeed, were the dis- 
turbances in attitude to be corrected. But 
here too, fewer errors are made with the new 
display (A). Taken alone, this might be due 
to chance, but the difference is in the same 
direction as the difference in roll which lends, 
therefore, a certain weight to it. The instruc- 
tion times needed before the subjects were fit 
to start the test and their preferences for the 
two displays, are shown in Table 2. These 
two criteria were measured for 20 of the 30 
men in each group, and show that significantly 
less instruction time was needed in the case 
of the new display (17 compared to 23.5 min- 
utes) which was also subjectively preferred 
by between six and seven times as many of 
the men (33 compared to 5) who were tested 
upon it. 

The results with displays (C) and (B) in 
roll with fresh groups of subjects fall into in- 
termediate positions between the other two 


Table 2 


The Length of Instruction Time Needed Before the 
Test Could be Started and the Subjective 
Preferences for the Two Displays 








Display 


Instruction Time - : i 
Neither 


Minutes Old New 
23.5 17.0 
6.5 + 1.84 





Mean 
Difference 


Number of Men with 
Preference for 33 
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as, indeed, did the displays themselves. In 
pitch (Figure 3), however, there was little to 
choose between the results given by the dif- 
ferent designs. 

The relations between the means and stand- 
ard deviations of these figures on the four 
types of display are of some interest, and 
they are shown in Table 3 and Figure 4. In 
the roll dimension, as the display becomes 
easier to interpret through the sequence D, 
C, B, and A, and errors fall from 29.7 to 24.7, 
so the scatter between subjects falls also from 
8.5 to 5.1. But the scatter within a given 
subject’s performance remains much more 
constant at between 3.3 and 4.3 errors. The 
figures for pitch demonstrate the same trend 
less markedly. As the test becomes harder it 
magnifies the individual differences, but it 
does not appear to make the performance of 
a single man more erratic. 

Ten subjects (groups c and d, Table 1) 
were tested upon each of the two displays 
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after previous experience with the other, to 
satisfy the desire for a “double” experimental 
design and to investigate the question of 
transfer. Errors were again fewer with dis- 
play (A) when first test was compared with 
first, and second with second (Table 4). 
There is positive transfer from test to test 
(Table 5), whichever of the two came first, 
but with a difference in degree. Previous ex- 
perience with display (D) stood the subjects 
in better stead than previous experience with 
(A) and this was more marked in the roll 
dimension in which the disturbances to be 
corrected were the greater. The positive 
transfer from (D) to (A) in roll was four 
times as great as in the reverse direction, and 
in pitch it was in the ratio of 1.6:1. This is 
to be expected if the new display (A) is 
easier to read than the old (D), and it makes 
the point that in this type of problem it is 
unsafe, in designing the experiment, to as- 


Table 3 
The Relationship of the Means to the Standard Deviations Between and Within Subjects 








Dimension 





Roll 


Pitch 





Standard Deviation 





Standard Deviation 











Mean Between Within Mean Between Within 
Display Errors Men Man Errors Men Man 
New A 24.7 5.1 3.9 20.6 6.1 3.2 
Old B 25.2 5.6 3.5 19.8 5.6 3.1 
Old C 28.4 7.6 3.3 20.7 5.8 3.7 
Old D 29.7 8.5 4.3 22.1 6.7 3.2 
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sume an equal amount of transfer in both Table 5 
directions. 

In the final part of the experiment, two 
groups (g) and (h) of ten air pilots who had 
about 300 hours experience with the old dis- 
play were tested on both displays in alternate (D) to (A) +6.5 +12.0 
order. Both groups had more men who pre- (A) to (D) +40 + 3.1 
ferred the new design (Table 6), but the dif- 
ferences in performance, while they slightly 
favored this design in pitch, were small and Table 6 
might have been due to chance. It is, per- 
haps, noteworthy that with so much previous 





Positive but Unequal Transfer between the Displays 





Transfer from Pitch Roll 








The Experience, Preference and Performance 
of the Trained Subjects 





Table 4 Group of 10,Subjects 


Fewer Errors with One Display after 
Experience with the Other 





Experience (hrs) 











No. with preference for:— New (A) 
Display Old (D) 


Fos Pion: a Neith 
Old (D) New (A) ini awe 
scaiimaeiadatiret Test Puch pay Errorsin:— Pitch: § New (A) 

Pitch Roll Sequence Pitch Roll — : — a 
= Old (D) 


20.7. 29.2 14.2 17.2 Roll: New (A) 
16.3 20.8 20.3 23.9 Old (D) 
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practice with the old design (D) they were 
not worse than they were when tested with 
the new (A). 


Discussion 


Craik (2) has pointed out that when an 
air pilot has a good unobstructed view of the 
external world, as in day flying, the aircraft 
can be considered to be an extension of his 
body which moves with him and which he can 
orientate in direct reference to the back- 
ground of the external horizon. But if, on 
the other hand, the view is obstructed or ab- 
sent, as at night or in cloud, the interior of 
the machine itself becomes the external en- 
vironment or background, and the pilot is 
then faced with the paradoxical fact that he 
and his background remain relatively fixed 
however he manipulates the controls. Craik, 
therefore, suggested a much larger representa- 
tion of the moving horizon, as in the old atti- 
tude indicator (D), as a way out of this 
situation. However, this may not entirely 
ensure that emergence of figure from ground, 
which forms the essential part of perception 
in this kind of display (Vernon, 4). Where 
the contrast between figure and ground is 
small the results of the present study suggest 
that it does not matter which of the two is 
moving and which fixed. But from the point 
of view of the immediate ad hoc problem of 
whether the aircraft or the horizon should 
move it can be argued a priori that it should 
be the aircraft. According to Rubin’s classifi- 
cation (Woodworth, 5) the aircraft has the 
characteristics of the “figure” rather than of 
the “ground,” because: (1) it has form while 
the horizon bar is relatively formless; (2) 
the aircraft tends to appear in front, the ho- 
rizon behind; and (3) the aircraft is more 
impressive and “more apt to suggest mean- 
ing.” 

In the design of any experiment of this 
kind to investigate two different displays, 
two difficulties have to be considered: (1) 
the comparability of the groups of subjects 
used; and (2) performance transfer, either 
positive or negative, from one test to the 
other. 

If one group of subjects tested on one dis- 
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play is compared with another group on a 
second display, any observed difference may, 
on the face of it, be either an intrinsic func- 
tion of the group or of the display. The use 
of large groups to which the subjects are 
allocated at random after careful matching by 
some analogous test, in theory, helps to en- 
sure their equivalence. But, in practice, the 
matching has to be demonstrably analogous 
to the problem in hand. It is not safe merely 
to assume or to assert this; neither is it easy 
usually to demonstrate this analogy, which at 
best means a lengthy piece of experimental 
work. The alternative design in which one 
half of the subjects is tested on one display 
and then on the other, and the other half of 
the subjects vice versa, can equally well be 
criticized on the ground that the second tests 
are only comparable if the amount of transfer 
is the same in both directions, which may well 
be unlikely (as in the present study), if one 
display is easier to perceive than the other. 
The conclusion seems to be that the experi- 
mental design must be arbitrary to a certain 
extent, and that the most secure design is, 
perhaps, a combination of both these methods. 

In an experiment which is generally com- 
parable to that described here, Loucks (3) 
showed that a display having a reversed sense 
to that of the “old” (D) indicator described 
in this paper produced a greater speed and 
accuracy of response than did one similar to 
the old indicator itself. He was also using 
subjects with no previous experience who ap- 
peared to identify themselves with the moving 
component of the display irrespective of its 
appearance. He suggests that it would be 
even better if the moving component were 
drawn in the shape of a small aircraft. But 
an experiment with this type of display was 
not, in fact, tried, and the present study sug- 
gests that this change might have made little 
difference. However, the numbers of subjects 
used in it were relatively small and the sub- 
ject must still be considered open. It seems 
clear that in order to alter the ease of per- 
ception, figure and ground must contrast in 
qualities other than mere relative movement, 
which alone seems unimportant. 








Figure and Ground in Two Dimensional Display 


Summary 


1. The speed and accuracy of human re- 
sponse to two displays which give information 
in two dimensions has been compared. Com- 
parisons of the instruction times needed be- 
fore the test could be started, and of the 
preferences of the subjects, have also been 
made. 

2. The two displays differed in respect of: 
(i) the relation of figure and ground and their 
relative movement; (ii) the damping of the 
oscillations after a given displacement; and 
(iii) their relative complication. 

3. The speed and accuracy of response was 
greater with the more simple display which 
had the heavier damping. This display also 
needed a shorter instruction time and was 
preferred more often by the subjects. Within 
the limits of the experimental design em- 
ployed the pure figure and ground relation 
alone appeared to play little part in percep- 
tion. 

4. The individual differences between sub- 
jects increased as perception became more 
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difficult. But the differences between differ- 
ent samples of a single subject’s performance 
remained constant. ' 

5. In an experimental design learning trans- 
fer between different tests must not be con- 
sidered to be the same. Neither is matching 
groups of subjects on an assumed analogous 
test experimentally safe. 


Received November 24, 1953. 
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Correction 
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